This improves performance for the game "Grounded", on a AMD Radeon RX 6700 XT, with radv from Mesa 22.3.6. Testing was done with the "cb_access_map_w" option enabled, which also improves performance with the game by itself.
From my testing, it's possible to raise the threshold from 2 ms up to 5 ms or so, before the driver or GPU seems to reclock back to the lower power level. However, this measurement is questionable for several reasons. It seems to vary depending on the scene being rendered, and of course this will be specific to the game and driver and GPU in question anyway. The game also has a weird approach to vsync that seems to involve it presenting stale frames (and hence artificially inflating the FPS), which I'm not fully sure I accounted for while measuring. And of course, it's hard to be sure that 5 ms is actually the threshold for how long the driver will go before powering down the GPU. In any case, it seems better to err on the side of submitting more often, to make sure the fix affects more drivers.
While submission isn't cheap, it seems to me that submitting every 2 ms is unlikely to cause a bottleneck [consider that this is at most 8 (more) submissions per frame].
The maximum of 4 concurrent periodically submitted buffers was chosen arbitrarily. Removing the maximum altogether does not measurably affect performance for this game either way.
Credit goes to Philip Rebohle and his work on DXVK for helping me to notice that periodic submission might make a difference.
From: Zebediah Figura zfigura@codeweavers.com
This improves performance for the game "Grounded", on a AMD Radeon RX 6700 XT, with radv from Mesa 22.3.6. Testing was done with the "cb_access_map_w" option enabled, which also improves performance with the game by itself.
From my testing, it's possible to raise the threshold from 2 ms up to 5 ms or so, before the driver or GPU seems to reclock back to the lower power level. However, this measurement is questionable for several reasons. It seems to vary depending on the scene being rendered, and of course this will be specific to the game and driver and GPU in question anyway. The game also has a weird approach to vsync that seems to involve it presenting stale frames (and hence artificially inflating the FPS), which I'm not fully sure I accounted for while measuring. And of course, it's hard to be sure that 5 ms is actually the threshold for how long the driver will go before powering down the GPU. In any case, it seems better to err on the side of submitting more often, to make sure the fix affects more drivers.
While submission isn't cheap, it seems to me that submitting every 2 ms is unlikely to cause a bottleneck [consider that this is at most 8 (more) submissions per frame].
The maximum of 4 concurrent periodically submitted buffers was chosen arbitrarily. Removing the maximum altogether does not measurably affect performance for this game either way.
Credit goes to Philip Rebohle and his work on DXVK for helping me to notice that periodic submission might make a difference. --- dlls/wined3d/adapter_vk.c | 34 ++++++++++++++++++++++++++++++++++ dlls/wined3d/context_vk.c | 2 ++ dlls/wined3d/wined3d_private.h | 1 + 3 files changed, 37 insertions(+)
diff --git a/dlls/wined3d/adapter_vk.c b/dlls/wined3d/adapter_vk.c index 5b6042659aa..1d330999b10 100644 --- a/dlls/wined3d/adapter_vk.c +++ b/dlls/wined3d/adapter_vk.c @@ -1711,6 +1711,36 @@ static void adapter_vk_flush_context(struct wined3d_context *context) wined3d_context_vk_submit_command_buffer(context_vk, 0, NULL, NULL, 0, NULL); }
+/* In general we only submit when necessary or when a frame ends. However, + * applications which do a lot of work per frame can end up with the GPU idle + * for long periods of time while the CPU is building commands, and drivers may + * choose to reclock the GPU to a lower power level if they detect it being idle + * for that long. + * + * This may also help performance simply by virtue of allowing more parallelism + * between the GPU and CPU, although no clear evidence of that has been seen + * yet. */ +static void perform_periodic_submit(struct wined3d_context_vk *context_vk) +{ + LARGE_INTEGER now, freq; + + assert(context_vk->current_command_buffer.vk_command_buffer); + + /* The point of periodic submit is to keep the GPU busy, so if it's already + * busy with 4 or more command buffers, don't submit another one now. */ + if (context_vk->current_command_buffer.id - context_vk->completed_command_buffer_id - 1 > 4) + return; + + QueryPerformanceCounter(&now); + QueryPerformanceFrequency(&freq); + + if (((now.QuadPart - context_vk->command_buffer_create_time.QuadPart) * 1000000) / freq.QuadPart >= 2000) + { + TRACE("Submitting command buffer because it's been 2 ms since it was created.\n"); + wined3d_context_vk_submit_command_buffer(context_vk, 0, NULL, NULL, 0, NULL); + } +} + static void adapter_vk_draw_primitive(struct wined3d_device *device, const struct wined3d_state *state, const struct wined3d_draw_parameters *parameters) { @@ -1796,6 +1826,8 @@ static void adapter_vk_draw_primitive(struct wined3d_device *device, context_vk->c.transform_feedback_active = 0; }
+ perform_periodic_submit(context_vk); + context_release(&context_vk->c); }
@@ -1840,6 +1872,8 @@ static void adapter_vk_dispatch_compute(struct wined3d_device *device, VK_CALL(vkCmdPipelineBarrier(vk_command_buffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT, 0, 0, NULL, 0, NULL, 0, NULL));
+ perform_periodic_submit(context_vk); + context_release(&context_vk->c); }
diff --git a/dlls/wined3d/context_vk.c b/dlls/wined3d/context_vk.c index 10713985e7e..d4cf48ac579 100644 --- a/dlls/wined3d/context_vk.c +++ b/dlls/wined3d/context_vk.c @@ -1767,6 +1767,8 @@ VkCommandBuffer wined3d_context_vk_get_command_buffer(struct wined3d_context_vk wined3d_query_vk_resume(query_vk, context_vk); }
+ QueryPerformanceCounter(&context_vk->command_buffer_create_time); + TRACE("Created new command buffer %p with id 0x%s.\n", buffer->vk_command_buffer, wine_dbgstr_longlong(buffer->id));
diff --git a/dlls/wined3d/wined3d_private.h b/dlls/wined3d/wined3d_private.h index f793c6c0cde..f13d8eeca48 100644 --- a/dlls/wined3d/wined3d_private.h +++ b/dlls/wined3d/wined3d_private.h @@ -2722,6 +2722,7 @@ struct wined3d_context_vk struct wined3d_command_buffer_vk current_command_buffer; uint64_t completed_command_buffer_id; VkDeviceSize retired_bo_size; + LARGE_INTEGER command_buffer_create_time;
struct {
+static void perform_periodic_submit(struct wined3d_context_vk *context_vk) +{ + LARGE_INTEGER now, freq; + + assert(context_vk->current_command_buffer.vk_command_buffer); + + /* The point of periodic submit is to keep the GPU busy, so if it's already + * busy with 4 or more command buffers, don't submit another one now. */ + if (context_vk->current_command_buffer.id - context_vk->completed_command_buffer_id - 1 > 4) + return;
I suppose it's not worth worrying about, but note that in principle this would do the wrong thing if command buffer IDs were to wrap.
@@ -1796,6 +1826,8 @@ static void adapter_vk_draw_primitive(struct wined3d_device *device, context_vk->c.transform_feedback_active = 0; } + perform_periodic_submit(context_vk); + context_release(&context_vk->c); } @@ -1840,6 +1872,8 @@ static void adapter_vk_dispatch_compute(struct wined3d_device *device, VK_CALL(vkCmdPipelineBarrier(vk_command_buffer, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT, 0, 0, NULL, 0, NULL, 0, NULL)); + perform_periodic_submit(context_vk); + context_release(&context_vk->c); }
I suppose this is fine, but couldn't we just call perform_periodic_submit() from wined3d_context_vk_get_command_buffer()? We already handle WINED3D_RETIRED_BO_SIZE_THRESHOLD there, and it would largely avoid having to worry about where the ideal places to insert these calls are. (E.g., would a ddraw application that largely just does blits hit these in time?)
It might be nice to use constants for "4" and "2000", much like how we have e.g. WINED3D_CS_QUERY_POLL_INTERVAL and WINED3D_ALLOCATOR_CHUNK_SIZE.
I suppose it's not worth worrying about, but note that in principle this would do the wrong thing if command buffer IDs were to wrap.
I don't think I see how, and anyway, don't we explicitly avoid wrapping in wined3d_context_vk_submit_command_buffer()?
I suppose this is fine, but couldn't we just call perform_periodic_submit() from wined3d_context_vk_get_command_buffer()? We already handle WINED3D_RETIRED_BO_SIZE_THRESHOLD there, and it would largely avoid having to worry about where the ideal places to insert these calls are. (E.g., would a ddraw application that largely just does blits hit these in time?)
That makes sense; I largely didn't consider it.
It might be nice to use constants for "4" and "2000", much like how we have e.g. WINED3D_CS_QUERY_POLL_INTERVAL and WINED3D_ALLOCATOR_CHUNK_SIZE.
Sure.
I suppose it's not worth worrying about, but note that in principle this would do the wrong thing if command buffer IDs were to wrap.
I don't think I see how, and anyway, don't we explicitly avoid wrapping in wined3d_context_vk_submit_command_buffer()?
If e.g. "completed_command_buffer_id" was ~(uint64_t)0, and "current_command_buffer.id" was 1, the subtraction would evaluate to something larger than 4, despite there only being a single pending command buffer. But you're right, we handle this in wined3d_context_vk_submit_command_buffer(), and there's other code that relies on that as well, so it's not an issue.
If e.g. "completed_command_buffer_id" was ~(uint64_t)0, and "current_command_buffer.id" was 1, the subtraction would evaluate to something larger than 4,
Actually, no, that works too.
I think this behavior should be put being a registry flag.
See: https://wiki.winehq.org/Useful_Registry_Keys
The threshold should be configurable in something like `Direct3D/CommandBufferThresholdMs` registry key, and be `0` to completely disable the behavior.
Also maybe 5ms is a better default threshold on low end systems?
I'd be in favor of putting it in a compile time constant like Henri suggested until someone demonstrates that there are cases where fine tuning the value (or disabling it altogether) improves things significantly. We probably don't want a wealth of registry keys that are not actually useful.
These submits would happen from the CSMT thread right? Does the command stream queue ever run empty? Should we submit any pending vulkan commands in this case?
Did you benchmark the impact of the full NT syscall every d3d draw?
These submits would happen from the CSMT thread right?
Yes.
Does the command stream queue ever run empty? Should we submit any pending vulkan commands in this case?
In Grounded? I can try that, though it's not clear it's better than the current approach...
Did you benchmark the impact of the full NT syscall every d3d draw?
I believe I tested it with this game at least, though ultimately I don't know how much that says. On the other hand, as often, I don't know how to fashion a benchmark that will prove meaningfully that it does or doesn't matter.
In theory, QueryPerformanceCounter() can be modified to avoid the syscall, at least on some architectures, though I'm told this is kind of hard.
Does the command stream queue ever run empty? Should we submit any pending vulkan commands in this case?
In Grounded? I can try that, though it's not clear it's better than the current approach...
It doesn't seem to be enough, which doesn't exactly surprise me. The problem is that the game's render thread just does a bunch of d3d11 calls as soon as the frame begins (for some definition of the beginning of the frame), and the GPU has already finished by the time this is going on.
Did you benchmark the impact of the full NT syscall every d3d draw?
I believe I tested it with this game at least, though ultimately I don't know how much that says. On the other hand, as often, I don't know how to fashion a benchmark that will prove meaningfully that it does or doesn't matter.
In theory, QueryPerformanceCounter() can be modified to avoid the syscall, at least on some architectures, though I'm told this is kind of hard.
I double checked this, and QueryPerformanceCounter() even on every call to wined3d_context_vk_get_command_buffer() doesn't really seem to matter—for that game, with that driver and hardware...
Does the command stream queue ever run empty? Should we submit any pending vulkan commands in this case?
In Grounded? I can try that, though it's not clear it's better than the current approach...
It doesn't seem to be enough, which doesn't exactly surprise me. The problem is that the game's render thread just does a bunch of d3d11 calls as soon as the frame begins (for some definition of the beginning of the frame), and the GPU has already finished by the time this is going on.
It seems largely orthogonal to me, although there's probably an argument to be made for calling adapter_flush_context() before calling wined3d_cs_wait_event().
Does the command stream queue ever run empty? Should we submit any pending vulkan commands in this case?
In Grounded? I can try that, though it's not clear it's better than the current approach...
It doesn't seem to be enough, which doesn't exactly surprise me. The problem is that the game's render thread just does a bunch of d3d11 calls as soon as the frame begins (for some definition of the beginning of the frame), and the GPU has already finished by the time this is going on.
It seems largely orthogonal to me, although there's probably an argument to be made for calling adapter_flush_context() before calling wined3d_cs_wait_event().
Agreed, it's a similar problem to what's handled by this MR but at a different "level". I have patches for that, I need to actually prepare and submit MRs...
One potential approach we could take is to spawn a dedicated thread whose role is to wait 2 ms and then submit, and kick that thread every time we start a command buffer. (Note that we'd need to use timeBeginPeriod to turn down the timer resolution). That would avoid the need for QueryPerformanceCounter(). It's not clear to me that it's worth the trouble, though.
On Thu May 11 08:30:04 2023 +0000, Zebediah Figura wrote:
One potential approach we could take is to spawn a dedicated thread whose role is to wait 2 ms and then submit, and kick that thread every time we start a command buffer. (Note that we'd need to use timeBeginPeriod to turn down the timer resolution). That would avoid the need for QueryPerformanceCounter(). It's not clear to me that it's worth the trouble, though.
If QueryPerformanceCounter() is too heavyweight, couldn't we just RDTSC and eyeball the threshold? We don't really care about the exact millisecond value, but I guess we could make a slightly more educated estimate by reading the CPU clock speed.
On Thu May 11 08:30:04 2023 +0000, Jan Sikorski wrote:
If QueryPerformanceCounter() is too heavyweight, couldn't we just RDTSC and eyeball the threshold? We don't really care about the exact millisecond value, but I guess we could make a slightly more educated estimate by reading the CPU clock speed.
In theory QueryPerformanceCounter() can itself use rdtsc. I think someone (Rémi?) has wip patches for that even. Perhaps upstreaming those would be better than using rdtsc directly.
I suppose the operative word is "if", though.