This improves performance for the game "Grounded", on a AMD Radeon RX 6700 XT, with radv from Mesa 22.3.6. Testing was done with the "cb_access_map_w" option enabled, which also improves performance with the game by itself.
From my testing, it's possible to raise the threshold from 2 ms up to 5 ms or so, before the driver or GPU seems to reclock back to the lower power level. However, this measurement is questionable for several reasons. It seems to vary depending on the scene being rendered, and of course this will be specific to the game and driver and GPU in question anyway. The game also has a weird approach to vsync that seems to involve it presenting stale frames (and hence artificially inflating the FPS), which I'm not fully sure I accounted for while measuring. And of course, it's hard to be sure that 5 ms is actually the threshold for how long the driver will go before powering down the GPU. In any case, it seems better to err on the side of submitting more often, to make sure the fix affects more drivers.
While submission isn't cheap, it seems to me that submitting every 2 ms is unlikely to cause a bottleneck [consider that this is at most 8 (more) submissions per frame].
The maximum of 4 concurrent periodically submitted buffers was chosen arbitrarily. Removing the maximum altogether does not measurably affect performance for this game either way.
Credit goes to Philip Rebohle and his work on DXVK for helping me to notice that periodic submission might make a difference.
-- v2: wined3d: Submit command buffers that are built for longer than 2 ms. wined3d: Retrieve the VkCommandBuffer from wined3d_context_vk after executing RTV barriers.
From: Zebediah Figura zfigura@codeweavers.com
Part of beginning a render pass involves executing an RTV barrier, which itself needs to call wined3d_context_vk_get_command_buffer(). However, that function may decide to submit the command buffer, in order to prevent resource buildup, or [in the future] because it has been some length of time since the last submission.
Therefore we cannot retrieve and store a VkCommandBuffer pointer before executing an RTV barrier and then use it later. --- dlls/wined3d/context_vk.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/dlls/wined3d/context_vk.c b/dlls/wined3d/context_vk.c index e59069ab809..9ed5b46ba96 100644 --- a/dlls/wined3d/context_vk.c +++ b/dlls/wined3d/context_vk.c @@ -2607,7 +2607,7 @@ static bool wined3d_context_vk_update_graphics_pipeline_key(struct wined3d_conte }
static bool wined3d_context_vk_begin_render_pass(struct wined3d_context_vk *context_vk, - VkCommandBuffer vk_command_buffer, const struct wined3d_state *state, const struct wined3d_vk_info *vk_info) + const struct wined3d_state *state, const struct wined3d_vk_info *vk_info) { struct wined3d_device_vk *device_vk = wined3d_device_vk(context_vk->c.device); VkClearValue clear_values[WINED3D_MAX_RENDER_TARGETS + 1]; @@ -2617,6 +2617,7 @@ static bool wined3d_context_vk_begin_render_pass(struct wined3d_context_vk *cont struct wined3d_rendertarget_view *view; const VkPhysicalDeviceLimits *limits; struct wined3d_query_vk *query_vk; + VkCommandBuffer vk_command_buffer; VkRenderPassBeginInfo begin_info; unsigned int attachment_count, i; struct wined3d_texture *texture; @@ -2722,6 +2723,12 @@ static bool wined3d_context_vk_begin_render_pass(struct wined3d_context_vk *cont ++attachment_count; }
+ if (!(vk_command_buffer = wined3d_context_vk_get_command_buffer(context_vk))) + { + ERR("Failed to get command buffer.\n"); + return false; + } + if (!(context_vk->vk_render_pass = wined3d_context_vk_get_render_pass(context_vk, &state->fb, ARRAY_SIZE(state->fb.render_targets), !!state->fb.depth_stencil, 0))) { @@ -3680,19 +3687,15 @@ VkCommandBuffer wined3d_context_vk_apply_draw_state(struct wined3d_context_vk *c
wined3d_context_vk_load_buffers(context_vk, state, indirect_vk, indexed);
- if (!(vk_command_buffer = wined3d_context_vk_get_command_buffer(context_vk))) - { - ERR("Failed to get command buffer.\n"); - return VK_NULL_HANDLE; - } - if (wined3d_context_is_graphics_state_dirty(&context_vk->c, STATE_FRAMEBUFFER)) wined3d_context_vk_end_current_render_pass(context_vk); - if (!wined3d_context_vk_begin_render_pass(context_vk, vk_command_buffer, state, vk_info)) + + if (!wined3d_context_vk_begin_render_pass(context_vk, state, vk_info)) { ERR("Failed to begin render pass.\n"); return VK_NULL_HANDLE; } + vk_command_buffer = context_vk->current_command_buffer.vk_command_buffer;
while (invalidate_rt) {
From: Zebediah Figura zfigura@codeweavers.com
This improves performance for the game "Grounded", on a AMD Radeon RX 6700 XT, with radv from Mesa 22.3.6. Testing was done with the "cb_access_map_w" option enabled, which also improves performance with the game by itself.
From my testing, it's possible to raise the threshold from 2 ms up to 5 ms or so, before the driver or GPU seems to reclock back to the lower power level. However, this measurement is questionable for several reasons. It seems to vary depending on the scene being rendered, and of course this will be specific to the game and driver and GPU in question anyway. The game also has a weird approach to vsync that seems to involve it presenting stale frames (and hence artificially inflating the FPS), which I'm not fully sure I accounted for while measuring. And of course, it's hard to be sure that 5 ms is actually the threshold for how long the driver will go before powering down the GPU. In any case, it seems better to err on the side of submitting more often, to make sure the fix affects more drivers.
While submission isn't cheap, it seems to me that submitting every 2 ms is unlikely to cause a bottleneck [consider that this is at most 8 (more) submissions per frame].
The maximum of 4 concurrent periodically submitted buffers was chosen arbitrarily. Removing the maximum altogether does not measurably affect performance for this game either way.
Credit goes to Philip Rebohle and his work on DXVK for helping me to notice that periodic submission might make a difference. --- dlls/wined3d/context_vk.c | 41 ++++++++++++++++++++++++++++++++++++++- dlls/wined3d/wined3d_vk.h | 1 + 2 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/dlls/wined3d/context_vk.c b/dlls/wined3d/context_vk.c index 9ed5b46ba96..d54fe9ad26e 100644 --- a/dlls/wined3d/context_vk.c +++ b/dlls/wined3d/context_vk.c @@ -1771,6 +1771,43 @@ void wined3d_context_vk_cleanup(struct wined3d_context_vk *context_vk) wined3d_context_cleanup(&context_vk->c); }
+/* In general we only submit when necessary or when a frame ends. However, + * applications which do a lot of work per frame can end up with the GPU idle + * for long periods of time while the CPU is building commands, and drivers may + * choose to reclock the GPU to a lower power level if they detect it being idle + * for that long. + * + * This may also help performance simply by virtue of allowing more parallelism + * between the GPU and CPU, although no clear evidence of that has been seen + * yet. */ + +#define WINED3D_PERIODIC_SUBMIT_INTERVAL_MICROSECONDS 2000 +#define WINED3D_PERIODIC_SUBMIT_MAX_BUFFERS 4 + +static bool should_periodic_submit(struct wined3d_context_vk *context_vk) +{ + LARGE_INTEGER now, freq; + uint64_t busy_count; + ULONGLONG diff; + + /* The point of periodic submit is to keep the GPU busy, so if it's already + * busy with 4 or more command buffers, don't submit another one now. */ + busy_count = context_vk->current_command_buffer.id - context_vk->completed_command_buffer_id - 1; + if (busy_count > WINED3D_PERIODIC_SUBMIT_MAX_BUFFERS) + return false; + + QueryPerformanceCounter(&now); + QueryPerformanceFrequency(&freq); + + diff = ((now.QuadPart - context_vk->command_buffer_create_time.QuadPart) * 1000000) / freq.QuadPart; + if (diff < WINED3D_PERIODIC_SUBMIT_INTERVAL_MICROSECONDS) + return false; + + TRACE("Periodically submitting command buffer, %I64u us since last buffer, %I64u currently busy.\n", + diff, busy_count); + return true; +} + VkCommandBuffer wined3d_context_vk_get_command_buffer(struct wined3d_context_vk *context_vk) { struct wined3d_device_vk *device_vk = wined3d_device_vk(context_vk->c.device); @@ -1785,7 +1822,7 @@ VkCommandBuffer wined3d_context_vk_get_command_buffer(struct wined3d_context_vk buffer = &context_vk->current_command_buffer; if (buffer->vk_command_buffer) { - if (context_vk->retired_bo_size > WINED3D_RETIRED_BO_SIZE_THRESHOLD) + if (context_vk->retired_bo_size > WINED3D_RETIRED_BO_SIZE_THRESHOLD || should_periodic_submit(context_vk)) wined3d_context_vk_submit_command_buffer(context_vk, 0, NULL, NULL, 0, NULL); else { @@ -1854,6 +1891,8 @@ VkCommandBuffer wined3d_context_vk_get_command_buffer(struct wined3d_context_vk wined3d_query_vk_resume(query_vk, context_vk); }
+ QueryPerformanceCounter(&context_vk->command_buffer_create_time); + TRACE("Created new command buffer %p with id 0x%s.\n", buffer->vk_command_buffer, wine_dbgstr_longlong(buffer->id));
diff --git a/dlls/wined3d/wined3d_vk.h b/dlls/wined3d/wined3d_vk.h index e995ef3c408..fe4f96cfd57 100644 --- a/dlls/wined3d/wined3d_vk.h +++ b/dlls/wined3d/wined3d_vk.h @@ -599,6 +599,7 @@ struct wined3d_context_vk struct wined3d_command_buffer_vk current_command_buffer; uint64_t completed_command_buffer_id; VkDeviceSize retired_bo_size; + LARGE_INTEGER command_buffer_create_time;
struct {
Rebased, moved the periodic submit to wined3d_context_vk_get_command_buffer(), and added a related fix.
Concerning performance:
* From my understanding (mostly informed by discussion with Arek Hiler), even recent hardware has serious reliability problems with rdtsc. Some of those problems go away for our more limited use case of measuring time differences, but not all do. E.g. it's possible for two cores to become seriously offset from each other, and unless we were to pin the CS to a single thread (dubious) that would mean our measurements could be constantly broken.
* On my machine, one call to QueryPerformanceCounter() seems to take about 80 ns.
* I constructed a simple benchmark that just makes a bunch of draw calls per frame (and nothing else). That benchmark is artificial, shows a consistent performance hit from this commit:
10000 draws: 217 vs 183 2000 draws: 860 vs 772 500 draws: 1900 vs 1840
* I can't perceive a difference in Grounded, however. This may be because Grounded doesn't call the function enough [it makes about 4200 draw calls per frame, at least in the scene I was testing, and that ends up being about 14600 calls to wined3d_context_vk_get_command_buffer()], or because the CS thread isn't a bottleneck [but it is pretty consistently at at least 90% CPU], or because the game's performance is too variable to measure.
If we are worried about performance—and I'm not really sure whether to be worried—then I'm open to trying the aforementioned approach with a separate submit thread.
On my machine, one call to QueryPerformanceCounter() seems to take about 80 ns.
How does that compare to e.g. QueryInterruptTime()?
If we are worried about performance—and I'm not really sure whether to be worried—then I'm open to trying the aforementioned approach with a separate submit thread.
How about something like CreateTimerQueueTimer() or SetWaitableTimer()?
On Thu Apr 4 14:14:51 2024 +0000, Elizabeth Figura wrote:
Rebased, moved the periodic submit to wined3d_context_vk_get_command_buffer(), and added a related fix. Concerning performance:
- From my understanding (mostly informed by discussion with Arek Hiler),
even recent hardware has serious reliability problems with rdtsc. Some of those problems go away for our more limited use case of measuring time differences, but not all do. E.g. it's possible for two cores to become seriously offset from each other, and unless we were to pin the CS to a single thread (dubious) that would mean our measurements could be constantly broken.
- On my machine, one call to QueryPerformanceCounter() seems to take
about 80 ns.
- I constructed a simple benchmark that just makes a bunch of draw calls
per frame (and nothing else). That benchmark is artificial, shows a consistent performance hit from this commit: 10000 draws: 217 vs 183 2000 draws: 860 vs 772 500 draws: 1900 vs 1840
- I can't perceive a difference in Grounded, however. This may be
because Grounded doesn't call the function enough [it makes about 4200 draw calls per frame, at least in the scene I was testing, and that ends up being about 14600 calls to wined3d_context_vk_get_command_buffer()], or because the CS thread isn't a bottleneck [but it is pretty consistently at at least 90% CPU], or because the game's performance is too variable to measure. If we are worried about performance—and I'm not really sure whether to be worried—then I'm open to trying the aforementioned approach with a separate submit thread.
Re Grounded, what was the improvement vs not submitting? You're not seeing a difference between QueryPerformanceCounter and rdtsc?
I constructed a simple benchmark that just makes a bunch of draw calls per frame (and nothing else). That benchmark is artificial, shows a consistent performance hit from this commit:
10000 draws: 217 vs 183 2000 draws: 860 vs 772 500 draws: 1900 vs 1840
Is that just the `QueryPerformanceCounter()` overhead? I.e., what does the benchmark say if you always return false from `should_periodic_submit()`?
I constructed a simple benchmark that just makes a bunch of draw calls per frame (and nothing else). That benchmark is artificial, shows a consistent performance hit from this commit:
Fyi, I have a collection of microbenchmarks here:
https://gitlab.winehq.org/stefan/perftest
Though most are stuck in d3d9 times. Once upon a time I was running them in an automated fashion daily...
On Thu Apr 4 16:43:51 2024 +0000, Stefan Dösinger wrote:
I constructed a simple benchmark that just makes a bunch of draw calls
per frame (and nothing else). That benchmark is artificial, shows a consistent performance hit from this commit: Fyi, I have a collection of microbenchmarks here: https://gitlab.winehq.org/stefan/perftest Though most are stuck in d3d9 times. Once upon a time I was running them in an automated fashion daily...
Ah sorry, I wasn't clear enough.
The difference in Grounded vs not submitting happens because of the aforementioned GPU scheduling (and that's still true on the same machine with Mesa 24.0.4-1).
What I tested here, and was a bit too sleep-deprived to properly explain, was adding an *extra* few calls to QueryPerformanceCounter() in wined3d_context_vk_get_command_buffer(). I didn't see a difference between one call and five calls. I did see a difference with 100 calls, so it does matter, it's just not clear whether *one* is enough to matter.
On my machine, one call to QueryPerformanceCounter() seems to take about 80 ns.
How does that compare to e.g. QueryInterruptTime()?
Currently in Wine Query(Unbiased)InterruptTime has the same resolution as GetTickCount(), which is "16 ms or whenever someone does a server call". That's not good enough here.
On Windows it has 1 ms resolution, or at least it does if someone calls timeBeginPeriod(). But I think Alexandre has resisted turning up the resolution in the server due to performance concerns.
If we are worried about performance—and I'm not really sure whether to be worried—then I'm open to trying the aforementioned approach with a separate submit thread.
How about something like CreateTimerQueueTimer() or SetWaitableTimer()?
Yes, that's what I was planning to use. Actually it'd even be enough to use a regular sleep or wait, at least on Wine, but I'm not sure that's quite reliable on Windows.
I constructed a simple benchmark that just makes a bunch of draw calls per frame (and nothing else). That benchmark is artificial, shows a consistent performance hit from this commit:
10000 draws: 217 vs 183 2000 draws: 860 vs 772 500 draws: 1900 vs 1840
Is that just the `QueryPerformanceCounter()` overhead? I.e., what does the benchmark say if you always return false from `should_periodic_submit()`?
So... I don't know what I was testing originally, but I can't reproduce that huge difference anymore. Skipping should_periodic_submit entirely, I see 217 vs 213 fps for 10000 draws, and no consistent measurable difference for 2000 draws.
I'm pretty sure my original tests were at least missing patch 1/2, so maybe that made the difference.
But I think Alexandre has resisted turning up the resolution in the server due to performance concerns.
Which, to be clear, ultimately means that if we need 1 ms resolution, we need to cross the PE/Unix boundary to get it.
On my machine, one call to QueryPerformanceCounter() seems to take about 80 ns.
How does that compare to e.g. QueryInterruptTime()?
Currently in Wine Query(Unbiased)InterruptTime has the same resolution as GetTickCount(), which is "16 ms or whenever someone does a server call". That's not good enough here.
Well, I was mostly interested in the relative call overhead. It's perhaps worth pointing out though that even without a callback, just calling SetWaitableTimer() with a 1 ms period would probably get the resolution close enough for our purposes. Whether that's a good idea is perhaps a separate discussion, but it would do for testing purposes at least.
Well, I was mostly interested in the relative call overhead.
QueryInterruptTime() is a little over 2 ns on this machine. (Rerunning the tests I also see QueryPerformanceCounter() at more like 62 ns, which is a little off from the measurement I made earlier.)
It's perhaps worth pointing out though that even without a callback, just calling SetWaitableTimer() with a 1 ms period would probably get the resolution close enough for our purposes.
Ah, I see what you're getting at—artificially invoke a timeout on the server side...
Whether that's a good idea is perhaps a separate discussion, but it would do for testing purposes at least.
It would, but I'm not seeing anything obvious to test?
Whether that's a good idea is perhaps a separate discussion, but it would do for testing purposes at least.
It would, but I'm not seeing anything obvious to test?
You already have a benchmark, see if it improves? :) The hit from this doesn't necessarily seem bad enough to block the MR, but I think the benchmark results do seem to hint at adding roughly 10% draw call overhead. If we could get that down to e.g. 1% or less, that seems worth pursuing.
I constructed a simple benchmark that just makes a bunch of draw calls per frame (and nothing else). That benchmark is artificial, shows a consistent performance hit from this commit:
10000 draws: 217 vs 183 2000 draws: 860 vs 772 500 draws: 1900 vs 1840
Is that just the `QueryPerformanceCounter()` overhead? I.e., what does the benchmark say if you always return false from `should_periodic_submit()`?
So... I don't know what I was testing originally, but I can't reproduce that huge difference anymore. Skipping should_periodic_submit entirely, I see 217 vs 213 fps for 10000 draws, and no consistent measurable difference for 2000 draws.
I'm pretty sure my original tests were at least missing patch 1/2, so maybe that made the difference.
Oops, it did. By calling wined3d_context_vk_begin_render_pass() and then accessing the command buffer directly, it bypassed the periodic submit logic. With that fixed I'm getting the same numbers as before.
You already have a benchmark, see if it improves? :) The hit from this doesn't necessarily seem bad enough to block the MR, but I think the benchmark results do seem to hint at adding roughly 10% draw call overhead. If we could get that down to e.g. 1% or less, that seems worth pursuing.
I tried this [set a waitable timer to 1 ms, and use QueryInterruptTime()], and it gets me about 211 fps. The difference between that and the original 217 fps is mostly the submit, but QueryInterruptTime() does hurt a bit, probably because of the loop? Directly accessing user_shared_data->InterruptTime.LowPart (which is the only part we need) is better; I'm not seeing any measurable difference between that and upstream Wine.
So I see three potential ways forward:
* Decide we don't care about an artificial benchmark, and just use this merge request as-is.
* Decide the server is okay with raising the tick count interval to 1 ms after all, at least if guarded by timeBeginPeriod()/timeEndPeriod(), and then use the interrupt time or tick count here.
* Try an approach that uses a separate thread. This may be a good idea anyway, since my benchmark shows that submitting does have some nonnegligible overhead.
The difference between that and the original 217 fps is mostly the submit, but QueryInterruptTime() does hurt a bit, probably because of the loop? Directly accessing user_shared_data->InterruptTime.LowPart (which is the only part we need) is better; I'm not seeing any measurable difference between that and upstream Wine.
To be clear, what I did here is hack the final "return true" in should_periodic_submit() to "return false", to measure the overhead of measuring time without actually performing the submit.