From: Matteo Bruni mbruni@codeweavers.com
With the buffer / texture mapping acceleration bits working properly, we can get many outstanding commands and, to avoid blocking the client thread while waiting for free space, we need a larger queue. 16 MiB was always enough in my testing (8 MiB wasn't in a few cases).
Keep a lower limit on 32-bit since we're often times address space-starved there.
Tweaked by Zebediah Figura. --- dlls/wined3d/wined3d_private.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/dlls/wined3d/wined3d_private.h b/dlls/wined3d/wined3d_private.h index 001534e8c38..be576fcfbde 100644 --- a/dlls/wined3d/wined3d_private.h +++ b/dlls/wined3d/wined3d_private.h @@ -5074,7 +5074,11 @@ enum wined3d_push_constants };
#define WINED3D_CS_QUERY_POLL_INTERVAL 10u +#if defined(_WIN64) +#define WINED3D_CS_QUEUE_SIZE 0x1000000u +#else #define WINED3D_CS_QUEUE_SIZE 0x400000u +#endif #define WINED3D_CS_SPIN_COUNT 10000000u #define WINED3D_CS_QUEUE_MASK (WINED3D_CS_QUEUE_SIZE - 1)
From: Matteo Bruni mbruni@codeweavers.com
Avoid busy spinning for a potentially long time. --- dlls/wined3d/cs.c | 25 ++++++++++++++++++++++++- dlls/wined3d/wined3d_private.h | 3 ++- 2 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/dlls/wined3d/cs.c b/dlls/wined3d/cs.c index 6a09ec2a35a..14f2ca9bb4a 100644 --- a/dlls/wined3d/cs.c +++ b/dlls/wined3d/cs.c @@ -724,6 +724,8 @@ static void wined3d_cs_exec_present(struct wined3d_cs *cs, const void *data) }
InterlockedDecrement(&cs->pending_presents); + if (InterlockedCompareExchange(&cs->waiting_for_present, FALSE, TRUE)) + SetEvent(cs->present_event); }
void wined3d_cs_emit_present(struct wined3d_cs *cs, struct wined3d_swapchain *swapchain, @@ -734,6 +736,8 @@ void wined3d_cs_emit_present(struct wined3d_cs *cs, struct wined3d_swapchain *sw unsigned int i; LONG pending;
+ wined3d_not_from_cs(cs); + op = wined3d_device_context_require_space(&cs->c, sizeof(*op), WINED3D_CS_QUEUE_DEFAULT); op->opcode = WINED3D_CS_OP_PRESENT; op->dst_window_override = dst_window_override; @@ -757,8 +761,17 @@ void wined3d_cs_emit_present(struct wined3d_cs *cs, struct wined3d_swapchain *sw * ahead of the worker thread. */ while (pending >= swapchain->max_frame_latency) { - YieldProcessor(); + InterlockedExchange(&cs->waiting_for_present, TRUE); + pending = InterlockedCompareExchange(&cs->pending_presents, 0, 0); + if (pending >= swapchain->max_frame_latency || !InterlockedCompareExchange(&cs->waiting_for_present, FALSE, TRUE)) + { + TRACE_(d3d_perf)("Reached latency limit (%u frames), blocking to wait.\n", swapchain->max_frame_latency); + wined3d_mutex_unlock(); + WaitForSingleObject(cs->present_event, INFINITE); + wined3d_mutex_lock(); + TRACE_(d3d_perf)("Woken up from the wait.\n"); + } } }
@@ -3472,11 +3485,18 @@ struct wined3d_cs *wined3d_cs_create(struct wined3d_device *device, heap_free(cs->data); goto fail; } + if (!(cs->present_event = CreateEventW(NULL, FALSE, FALSE, NULL))) + { + ERR("Failed to create command stream present event.\n"); + heap_free(cs->data); + goto fail; + }
if (!(GetModuleHandleExW(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS, (const WCHAR *)wined3d_cs_run, &cs->wined3d_module))) { ERR("Failed to get wined3d module handle.\n"); + CloseHandle(cs->present_event); CloseHandle(cs->event); heap_free(cs->data); goto fail; @@ -3486,6 +3506,7 @@ struct wined3d_cs *wined3d_cs_create(struct wined3d_device *device, { ERR("Failed to create wined3d command stream thread.\n"); FreeLibrary(cs->wined3d_module); + CloseHandle(cs->present_event); CloseHandle(cs->event); heap_free(cs->data); goto fail; @@ -3508,6 +3529,8 @@ void wined3d_cs_destroy(struct wined3d_cs *cs) { wined3d_cs_emit_stop(cs); CloseHandle(cs->thread); + if (!CloseHandle(cs->present_event)) + ERR("Closing present event failed.\n"); if (!CloseHandle(cs->event)) ERR("Closing event failed.\n"); } diff --git a/dlls/wined3d/wined3d_private.h b/dlls/wined3d/wined3d_private.h index be576fcfbde..c71ac59270b 100644 --- a/dlls/wined3d/wined3d_private.h +++ b/dlls/wined3d/wined3d_private.h @@ -5131,8 +5131,9 @@ struct wined3d_cs struct list query_poll_list; BOOL queries_flushed;
- HANDLE event; + HANDLE event, present_event; LONG waiting_for_event; + LONG waiting_for_present; LONG pending_presents; };
From: Matteo Bruni mbruni@codeweavers.com
Yielding to other threads might help to get new packets and (very slightly) reduce contention on the queue head and tail pointers. --- dlls/wined3d/cs.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/dlls/wined3d/cs.c b/dlls/wined3d/cs.c index 14f2ca9bb4a..2cd53cd7a5b 100644 --- a/dlls/wined3d/cs.c +++ b/dlls/wined3d/cs.c @@ -3421,6 +3421,7 @@ static DWORD WINAPI wined3d_cs_run(void *ctx) queue = &cs->queue[WINED3D_CS_QUEUE_DEFAULT]; if (wined3d_cs_queue_is_empty(cs, queue)) { + YieldProcessor(); if (++spin_count >= WINED3D_CS_SPIN_COUNT && list_empty(&cs->query_poll_list)) wined3d_cs_wait_event(cs); continue;
From: Matteo Bruni mbruni@codeweavers.com
It can't block because there are pending queries. --- dlls/wined3d/cs.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/dlls/wined3d/cs.c b/dlls/wined3d/cs.c index 2cd53cd7a5b..04e37f5e712 100644 --- a/dlls/wined3d/cs.c +++ b/dlls/wined3d/cs.c @@ -3422,8 +3422,13 @@ static DWORD WINAPI wined3d_cs_run(void *ctx) if (wined3d_cs_queue_is_empty(cs, queue)) { YieldProcessor(); - if (++spin_count >= WINED3D_CS_SPIN_COUNT && list_empty(&cs->query_poll_list)) - wined3d_cs_wait_event(cs); + if (++spin_count >= WINED3D_CS_SPIN_COUNT) + { + if (list_empty(&cs->query_poll_list)) + wined3d_cs_wait_event(cs); + else + Sleep(0); + } continue; } }
From: Matteo Bruni mbruni@codeweavers.com
Over the last few Wine releases we greatly reduced the need for the application thread to wait for replies from the CS thread. Compared to the time when the command stream was initially introduced, it's now quite likely that, when the command queues become empty, they are going to stay like that for a while (e.g. the game is throttling the framerate or is busy doing some CPU work on its part before generating more commands).
As a first step, reduce the spin count to reduce the busy waiting in wined3d_cs_run(). In my tests this significantly reduces CPU usage, with minimally decreased performance in a few rare cases but also significantly improved performance in many cases (notably with integrated GPUs where freeing the CPU directly allows more power to be allocated to the GPU). --- dlls/wined3d/wined3d_private.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/dlls/wined3d/wined3d_private.h b/dlls/wined3d/wined3d_private.h index c71ac59270b..efdcff41e58 100644 --- a/dlls/wined3d/wined3d_private.h +++ b/dlls/wined3d/wined3d_private.h @@ -5079,7 +5079,7 @@ enum wined3d_push_constants #else #define WINED3D_CS_QUEUE_SIZE 0x400000u #endif -#define WINED3D_CS_SPIN_COUNT 10000000u +#define WINED3D_CS_SPIN_COUNT 2000u #define WINED3D_CS_QUEUE_MASK (WINED3D_CS_QUEUE_SIZE - 1)
C_ASSERT(!(WINED3D_CS_QUEUE_SIZE & (WINED3D_CS_QUEUE_SIZE - 1)));
Wrt patch 4: is "yield CPU" an accurate description of what Sleep(0) does anymore? Separately, is that what we want, instead of NtYieldExecution()?
Wrt patch 5: I intend to sign off on the patch anyway, but I'm curious what if anything prevents us from going further. The subject mentions "minimally decreased" performance—was there a measurable (however minimal) loss, and do we remember with which applications? I had an out-of-tree patch to use TID alerts, do you know if that helps enough to reduce the spin count further or eliminate it altogether? I believe you had another out-of-tree patch somewhere that did remove the spin count altogether, I'm curious if you remember more of the details around that.
Wrt patch 4: is "yield CPU" an accurate description of what Sleep(0) does anymore? Separately, is that what we want, instead of NtYieldExecution()?
Indeed the comment isn't really accurate anymore. Still we might want to Sleep(), which is more similar to what we end up doing in wined3d_cs_wait_event(). I'll do some testing.
Wrt patch 5: I intend to sign off on the patch anyway, but I'm curious what if anything prevents us from going further. The subject mentions "minimally decreased" performance—was there a measurable (however minimal) loss, and do we remember with which applications?
That was a long time ago and I'm not entirely sure anymore. It was probably more to do with the (old) CPU of the box that I used for testing at the time not having any kind of boosting of individual cores based on total CPU load than any specific game. FWIW the game I was testing with this back then was probably World of Tanks, whatever version was current at the time, with whatever hacked wined3d version I had...
Basically, as I remember it, sometimes waiting in some form in the CS thread meant that we serviced the following commands a bit later and thus FPS went down (a couple of % points or so at most maybe? I can't say for sure). Nowadays it's very likely that we more than make up for that by freeing CPU time to be used by some application thread or just so that CPU cores already used by the application get more power headroom and can execute faster.
I could give some fresh testing to this as well.
I had an out-of-tree patch to use TID alerts, do you know if that helps enough to reduce the spin count further or eliminate it altogether? I believe you had another out-of-tree patch somewhere that did remove the spin count altogether, I'm curious if you remember more of the details around that.
Yeah, that would be in part 2 or 3 of the series :smile: We can probably just go ahead and get rid of the spinning entirely but I preferred to put the least controversial stuff at the front and get rid of most of the CPU cycles waste by the end of the first part. As I remember it, I didn't find downsides to just never spinning at the time, but then I recently found some other place in wined3d where maybe spinning a bit before waiting is a good idea. I'll reevaluate all that after this MR is done with.
Wrt patch 4: is "yield CPU" an accurate description of what Sleep(0) does anymore? Separately, is that what we want, instead of NtYieldExecution()?
Indeed the comment isn't really accurate anymore. Still we might want to Sleep(), which is more similar to what we end up doing in wined3d_cs_wait_event(). I'll do some testing.
Except that Sleep(0) and NtYieldExecution() basically do the same thing right now (I had forgot how the whole NtDelayExecution() / NtYieldExecution() thing went). NtDelayExecution(TRUE, ...) / server_wait() or similar seems a bit heavy handed here and I'm not sure we have any interesting alternative.
Yeah, that would be in part 2 or 3 of the series :smile: We can probably just go ahead and get rid of the spinning entirely but I preferred to put the least controversial stuff at the front and get rid of most of the CPU cycles waste by the end of the first part. As I remember it, I didn't find downsides to just never spinning at the time, but then I recently found some other place in wined3d where maybe spinning a bit before waiting is a good idea. I'll reevaluate all that after this MR is done with.
Agreed, I was definitely going to sign off on the commit anyway, I just wanted to know what the way forward from there looked like.
Except that Sleep(0) and NtYieldExecution() basically do the same thing right now (I had forgot how the whole NtDelayExecution() / NtYieldExecution() thing went). NtDelayExecution(TRUE, ...) / server_wait() or similar seems a bit heavy handed here and I'm not sure we have any interesting alternative.
Yeah, so right now the choice is I think between sched_yield() and select(0). I don't have an understanding of the problem so I don't know which one we want. I'll leave it to your testing, assuming you can come up with interesting results :-)
This merge request was approved by Zebediah Figura.
Yeah, so right now the choice is I think between sched_yield() and select(0). I don't have an understanding of the problem so I don't know which one we want. I'll leave it to your testing, assuming you can come up with interesting results :-)
I don't know that there is any understanding to be had aside from empirically trying all the options and observing if there is any significant difference :smile:
This merge request was approved by Jan Sikorski.