This improves performance for the game "Grounded", on a AMD Radeon RX 6700 XT, with radv from Mesa 22.3.6. Testing was done with the "cb_access_map_w" option enabled, which also improves performance with the game by itself.
From my testing, it's possible to raise the threshold from 2 ms up to 5 ms or so, before the driver or GPU seems to reclock back to the lower power level. However, this measurement is questionable for several reasons. It seems to vary depending on the scene being rendered, and of course this will be specific to the game and driver and GPU in question anyway. The game also has a weird approach to vsync that seems to involve it presenting stale frames (and hence artificially inflating the FPS), which I'm not fully sure I accounted for while measuring. And of course, it's hard to be sure that 5 ms is actually the threshold for how long the driver will go before powering down the GPU. In any case, it seems better to err on the side of submitting more often, to make sure the fix affects more drivers.
While submission isn't cheap, it seems to me that submitting every 2 ms is unlikely to cause a bottleneck [consider that this is at most 8 (more) submissions per frame].
The maximum of 4 concurrent periodically submitted buffers was chosen arbitrarily. Removing the maximum altogether does not measurably affect performance for this game either way.
Credit goes to Philip Rebohle and his work on DXVK for helping me to notice that periodic submission might make a difference.
-- v3: wined3d: Submit command buffers after 512 draw or dispatch commands. wined3d: Retrieve the VkCommandBuffer from wined3d_context_vk after executing RTV barriers.
From: Zebediah Figura zfigura@codeweavers.com
Part of beginning a render pass involves executing an RTV barrier, which itself needs to call wined3d_context_vk_get_command_buffer(). However, that function may decide to submit the command buffer, in order to prevent resource buildup, or [in the future] because it has been some length of time since the last submission.
Therefore we cannot retrieve and store a VkCommandBuffer pointer before executing an RTV barrier and then use it later. --- dlls/wined3d/context_vk.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/dlls/wined3d/context_vk.c b/dlls/wined3d/context_vk.c index dc793a839bb..377c437ee09 100644 --- a/dlls/wined3d/context_vk.c +++ b/dlls/wined3d/context_vk.c @@ -2699,7 +2699,7 @@ static bool wined3d_context_vk_update_graphics_pipeline_key(struct wined3d_conte }
static bool wined3d_context_vk_begin_render_pass(struct wined3d_context_vk *context_vk, - VkCommandBuffer vk_command_buffer, const struct wined3d_state *state, const struct wined3d_vk_info *vk_info) + const struct wined3d_state *state, const struct wined3d_vk_info *vk_info) { struct wined3d_device_vk *device_vk = wined3d_device_vk(context_vk->c.device); VkClearValue clear_values[WINED3D_MAX_RENDER_TARGETS + 1]; @@ -2709,6 +2709,7 @@ static bool wined3d_context_vk_begin_render_pass(struct wined3d_context_vk *cont struct wined3d_rendertarget_view *view; const VkPhysicalDeviceLimits *limits; struct wined3d_query_vk *query_vk; + VkCommandBuffer vk_command_buffer; VkRenderPassBeginInfo begin_info; unsigned int attachment_count, i; struct wined3d_texture *texture; @@ -2814,6 +2815,12 @@ static bool wined3d_context_vk_begin_render_pass(struct wined3d_context_vk *cont ++attachment_count; }
+ if (!(vk_command_buffer = wined3d_context_vk_get_command_buffer(context_vk))) + { + ERR("Failed to get command buffer.\n"); + return false; + } + if (!(context_vk->vk_render_pass = wined3d_context_vk_get_render_pass(context_vk, &state->fb, ARRAY_SIZE(state->fb.render_targets), !!state->fb.depth_stencil, 0))) { @@ -3772,19 +3779,15 @@ VkCommandBuffer wined3d_context_vk_apply_draw_state(struct wined3d_context_vk *c
wined3d_context_vk_load_buffers(context_vk, state, indirect_vk, indexed);
- if (!(vk_command_buffer = wined3d_context_vk_get_command_buffer(context_vk))) - { - ERR("Failed to get command buffer.\n"); - return VK_NULL_HANDLE; - } - if (wined3d_context_is_graphics_state_dirty(&context_vk->c, STATE_FRAMEBUFFER)) wined3d_context_vk_end_current_render_pass(context_vk); - if (!wined3d_context_vk_begin_render_pass(context_vk, vk_command_buffer, state, vk_info)) + + if (!wined3d_context_vk_begin_render_pass(context_vk, state, vk_info)) { ERR("Failed to begin render pass.\n"); return VK_NULL_HANDLE; } + vk_command_buffer = context_vk->current_command_buffer.vk_command_buffer;
while (invalidate_rt) {
From: Zebediah Figura zfigura@codeweavers.com
This improves performance for the game "Grounded", on a AMD Radeon RX 6700 XT, with radv from Mesa 22.3.6. Testing was done with the "cb_access_map_w" option enabled, which also improves performance with the game by itself.
Grounded generally makes about 4000 draw calls per frame, which seems not atypical. This change makes it submit at most an extra 8 times per frame, but in practice due to WINED3D_PERIODIC_SUBMIT_MAX_BUFFERS it submits less (usually only 2-3).
The most demanding game I've seen made about 20,000 draw calls per frame, at which point the overhead of adapter_vk_draw_primitive() itself becomes a serious bottleneck. For such a game we would submit 40 more times per frame with these settings, although WINED3D_PERIODIC_SUBMIT_MAX_BUFFERS means we would likely submit less than that. In any case if submission itself becomes a bottleneck, we should offload it to a separate thread.
Credit goes to Philip Rebohle and his work on DXVK for helping me to notice that periodic submission might make a difference. --- dlls/wined3d/adapter_vk.c | 4 ++++ dlls/wined3d/context_vk.c | 35 ++++++++++++++++++++++++++++++++++- dlls/wined3d/wined3d_vk.h | 3 +++ 3 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/dlls/wined3d/adapter_vk.c b/dlls/wined3d/adapter_vk.c index 418b67de8b6..39b263872ed 100644 --- a/dlls/wined3d/adapter_vk.c +++ b/dlls/wined3d/adapter_vk.c @@ -1807,6 +1807,8 @@ static void adapter_vk_draw_primitive(struct wined3d_device *device, context_vk->c.transform_feedback_active = 0; }
+ ++context_vk->command_buffer_work_count; + context_release(&context_vk->c); }
@@ -1851,6 +1853,8 @@ static void adapter_vk_dispatch_compute(struct wined3d_device *device, VK_CALL(vkCmdPipelineBarrier(vk_command_buffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT, 0, 0, NULL, 0, NULL, 0, NULL));
+ ++context_vk->command_buffer_work_count; + context_release(&context_vk->c); }
diff --git a/dlls/wined3d/context_vk.c b/dlls/wined3d/context_vk.c index 377c437ee09..200dcb57b7f 100644 --- a/dlls/wined3d/context_vk.c +++ b/dlls/wined3d/context_vk.c @@ -1771,6 +1771,37 @@ void wined3d_context_vk_cleanup(struct wined3d_context_vk *context_vk) wined3d_context_cleanup(&context_vk->c); }
+/* In general we only submit when necessary or when a frame ends. However, + * applications which do a lot of work per frame can end up with the GPU idle + * for long periods of time while the CPU is building commands, and drivers may + * choose to reclock the GPU to a lower power level if they detect it being idle + * for that long. + * + * This may also help performance simply by virtue of allowing more parallelism + * between the GPU and CPU, although no clear evidence of that has been seen + * yet. */ + +#define WINED3D_PERIODIC_SUBMIT_WORK_COUNT 512 +#define WINED3D_PERIODIC_SUBMIT_MAX_BUFFERS 3 + +static bool should_periodic_submit(struct wined3d_context_vk *context_vk) +{ + uint64_t busy_count; + + if (context_vk->command_buffer_work_count < WINED3D_PERIODIC_SUBMIT_WORK_COUNT) + return false; + + /* The point of periodic submit is to keep the GPU busy, so if it's already + * busy with 4 or more command buffers, don't submit another one now. */ + busy_count = context_vk->current_command_buffer.id - context_vk->completed_command_buffer_id - 1; + if (busy_count > WINED3D_PERIODIC_SUBMIT_MAX_BUFFERS) + return false; + + TRACE("Periodically submitting command buffer, %u draw/dispatch commands since last buffer, %I64u currently busy.\n", + context_vk->command_buffer_work_count, busy_count); + return true; +} + VkCommandBuffer wined3d_context_vk_get_command_buffer(struct wined3d_context_vk *context_vk) { struct wined3d_device_vk *device_vk = wined3d_device_vk(context_vk->c.device); @@ -1785,7 +1816,7 @@ VkCommandBuffer wined3d_context_vk_get_command_buffer(struct wined3d_context_vk buffer = &context_vk->current_command_buffer; if (buffer->vk_command_buffer) { - if (context_vk->retired_bo_size > WINED3D_RETIRED_BO_SIZE_THRESHOLD) + if (context_vk->retired_bo_size > WINED3D_RETIRED_BO_SIZE_THRESHOLD || should_periodic_submit(context_vk)) wined3d_context_vk_submit_command_buffer(context_vk, 0, NULL, NULL, 0, NULL); else { @@ -1854,6 +1885,8 @@ VkCommandBuffer wined3d_context_vk_get_command_buffer(struct wined3d_context_vk wined3d_query_vk_resume(query_vk, context_vk); }
+ context_vk->command_buffer_work_count = 0; + TRACE("Created new command buffer %p with id 0x%s.\n", buffer->vk_command_buffer, wine_dbgstr_longlong(buffer->id));
diff --git a/dlls/wined3d/wined3d_vk.h b/dlls/wined3d/wined3d_vk.h index 94a6b6c0c5e..ad8eb2453f5 100644 --- a/dlls/wined3d/wined3d_vk.h +++ b/dlls/wined3d/wined3d_vk.h @@ -614,6 +614,9 @@ struct wined3d_context_vk struct wined3d_command_buffer_vk current_command_buffer; uint64_t completed_command_buffer_id; VkDeviceSize retired_bo_size; + /* Number of draw or dispatch calls that have been recorded into the + * current command buffer. */ + unsigned int command_buffer_work_count;
struct {
Trying to do this based on time is very annoyingly hard.
* Testing the lower 32 bits of the interrupt time / tick count is fast enough, but the resolution isn't good enough. Increasing the resolution to 1 ms (i.e. reducing the timeout in the server) in the occasion that an application calls timeBeginPeriod() is feasible, but we don't want to do it unless we need to, and we don't really know that ahead of time in d3d (e.g. we don't know whether we're in an game that's trying its best to make 60 fps, or a productivity application that shouldn't consume any more CPU than necessary).
* Trying to use a separate thread is a large amount of code to begin with, which isn't great. More concerningly, we need it to synchronize with the CS thread, and even the overhead of using a mutex hurts my artificial benchmark in a noticeable way.
* I briefly examined the idea of using a separate thread but submitting *to* the CS thread, basically as a client thread. The problem is that we can be way ahead of the CS thread, and we have no idea how long it's actually been since the last submission.
So I'm giving up on that approach for now, and just counting draw and dispatch calls instead. This is of course cheap to measure, and should ultimately work just as well. It may result in submitting *too* often in case the application makes a *lot* of draw calls, but as stated in the commit message, we should prevent that by limiting the number of inflight command buffers, and anyway if vkQueueSubmit() itself does clearly become a bottleneck, we can offload it to a separate thread [synchronization there is a lot easier if that thread doesn't also have to decide to *end* a command buffer that the CS is currently using.]
This merge request was approved by Jan Sikorski.