- Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=45893 - Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=47767 - Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=44546
From: Anton Baskanov baskanov@gmail.com
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=45893 Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=47767 Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=44546 --- dlls/wined3d/cs.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/dlls/wined3d/cs.c b/dlls/wined3d/cs.c index 8ee7e7fccc9..b2f682e9837 100644 --- a/dlls/wined3d/cs.c +++ b/dlls/wined3d/cs.c @@ -3431,6 +3431,19 @@ struct wined3d_cs *wined3d_cs_create(struct wined3d_device *device, } }
+ if (wined3d_settings.cs_multithreaded & WINED3D_CSMT_ENABLE) + { + DWORD_PTR affinity_mask; + if (GetProcessAffinityMask(GetCurrentProcess(), &affinity_mask, NULL)) + { + if (!(affinity_mask & (affinity_mask - 1))) + { + WARN("Disabling CSMT, process is running on a single processor.\n"); + wined3d_settings.cs_multithreaded &= ~WINED3D_CSMT_ENABLE; + } + } + } + if (wined3d_settings.cs_multithreaded & WINED3D_CSMT_ENABLE && !RtlIsCriticalSectionLockedByThread(NtCurrentTeb()->Peb->LoaderLock)) {
This will unlikely do it for a generic case. Affinity can be change after CS thread is created (both ways). Besides, cs thread serves other purposes besides optimization (e. g., shader compilation used to break on nv drivers sometimes when non-default FPU setup is used). I guess a better thing to do would be maybe to get rid of spinlocking in cs thread which should also be helping in favour of SRW lock / condition variable?
On 5 Nov 2022, at 10:39, Anton Baskanov wine@gitlab.winehq.org wrote:
From: Anton Baskanov baskanov@gmail.com
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=45893 Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=47767 Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=44546
dlls/wined3d/cs.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/dlls/wined3d/cs.c b/dlls/wined3d/cs.c index 8ee7e7fccc9..b2f682e9837 100644 --- a/dlls/wined3d/cs.c +++ b/dlls/wined3d/cs.c @@ -3431,6 +3431,19 @@ struct wined3d_cs *wined3d_cs_create(struct wined3d_device *device, } }
- if (wined3d_settings.cs_multithreaded & WINED3D_CSMT_ENABLE)
- {
DWORD_PTR affinity_mask;
if (GetProcessAffinityMask(GetCurrentProcess(), &affinity_mask, NULL))
{
if (!(affinity_mask & (affinity_mask - 1)))
{
WARN("Disabling CSMT, process is running on a single processor.\n");
wined3d_settings.cs_multithreaded &= ~WINED3D_CSMT_ENABLE;
}
}
- }
- if (wined3d_settings.cs_multithreaded & WINED3D_CSMT_ENABLE && !RtlIsCriticalSectionLockedByThread(NtCurrentTeb()->Peb->LoaderLock)) {
-- GitLab
On Sat Nov 5 17:01:44 2022 +0000, **** wrote:
Paul Gofman replied on the mailing list:
This will unlikely do it for a generic case. Affinity can be change after CS thread is created (both ways). Besides, cs thread serves other purposes besides optimization (e. g., shader compilation used to break on nv drivers sometimes when non-default FPU setup is used). I guess a better thing to do would be maybe to get rid of spinlocking in cs thread which should also be helping in favour of SRW lock / condition variable?
I don't think we can just replace the spin locks with condition variable as it would add a significant overhead to the map/unmap calls.
I tried replacing YieldProcessor() calls with SwitchToThread() and removing spinning in wined3d_cs_run() and although it improved performance for the 3 games from the bug reports, 2 of them (Midtown Madness Trial and Settlers II: 10th Anniversary Demo) were still slower compared to either disabling CSMT or disabling SetProcessAffinityMask(). Midtown Madness issues draw calls with user memory data, and Settlers II maps, writes and unmaps vertex and index buffers before each draw call.
I can think of the following options: - Rewrite buffer map/unmap implementation so that it does not wait for the CS thread to finish (at least in some cases). I don't know how hard this would be, but it should help many other games as well. - Ignore the CS thread in SetProcessAffinityMask() (probably too hacky).
On 6 Nov 2022, at 03:15, Anton Baskanov (@baskanov) wine@gitlab.winehq.org wrote:
On Sat Nov 5 17:01:44 2022 +0000, **** wrote:
Paul Gofman replied on the mailing list:
This will unlikely do it for a generic case. Affinity can be change after CS thread is created (both ways). Besides, cs thread serves other purposes besides optimization (e. g., shader compilation used to break on nv drivers sometimes when non-default FPU setup is used). I guess a better thing to do would be maybe to get rid of spinlocking in cs thread which should also be helping in favour of SRW lock / condition variable?
I don't think we can just replace the spin locks with condition variable as it would add a significant overhead to the map/unmap calls.
No, it won’t add significant overhead. On Linux I don’t see how it can end up in any overhead, as SRW/condition variables end up in futex which spins itself when thinks it makes sense (on kernel side). Unfortunately on other platforms cond vars are less efficient, yet I personally guess that removing the spinlock would do more good than harm generally.
I tried replacing YieldProcessor() calls with SwitchToThread()
You need a normal sync object there to try the effect, not tweaking the spin. Adding a hack to avoid cs thread is a dead end, I am afraid.
On Sun Nov 6 15:22:54 2022 +0000, **** wrote:
Paul Gofman replied on the mailing list:
> On 6 Nov 2022, at 03:15, Anton Baskanov (@baskanov) <wine@gitlab.winehq.org> wrote: > > On Sat Nov 5 17:01:44 2022 +0000, **** wrote: >> Paul Gofman replied on the mailing list: >> \`\`\` >> This will unlikely do it for a generic case. Affinity can be change >> after CS thread is created (both ways). Besides, cs thread serves other >> purposes besides optimization (e. g., shader compilation used to break >> on nv drivers sometimes when non-default FPU setup is used). I guess a >> better thing to do would be maybe to get rid of spinlocking in cs thread >> which should also be helping in favour of SRW lock / condition variable? >> \`\`\` > I don't think we can just replace the spin locks with condition variable as it would add a significant overhead to the map/unmap calls. > No, it won’t add significant overhead. On Linux I don’t see how it can end up in any overhead, as SRW/condition variables end up in futex which spins itself when thinks it makes sense (on kernel side). Unfortunately on other platforms cond vars are less efficient, yet I personally guess that removing the spinlock would do more good than harm generally. > I tried replacing YieldProcessor() calls with SwitchToThread() You need a normal sync object there to try the effect, not tweaking the spin. Adding a hack to avoid cs thread is a dead end, I am afraid.
I don't know whether this patch makes sense or not, but we should probably at least try getting rid of the spinning first. I know Matteo was working on this; @Mystral, how's the progress on that work going?
On Mon Nov 7 19:04:08 2022 +0000, Zebediah Figura wrote:
I don't know whether this patch makes sense or not, but we should probably at least try getting rid of the spinning first. I know Matteo was working on this; @Mystral, how's the progress on that work going?
I think I'm almost there, actually. I should be able to send a few MRs in the short term, barring new distractions :D
I haven't tested the games from the bugs referenced in this MR with my patches, or the patches on a single-core machine recently, but I expect them to help enough that there would be no need to explicitly disable CSMT on single-core machines. That said, I do happen to have a patch doing basically the same as this MR anyway (except I'm looking at dwNumberOfProcessors from GetSystemInfo()). The idea was that there isn't really anything to be gained by having a separate CS thread on a single-core machine, while the additional overhead would probably be non-negligible. FWIW, Windows happens (or happened, many years ago when I last looked into it) to do the same, not spawning a separate thread when e.g. forcing a single core. Admittedly, I didn't think about the Nvidia FPU control word issue, although that's arguably a driver bug and CSMT is an accidental workaround.
BTW,
Rewrite buffer map/unmap implementation so that it does not wait for the CS thread to finish (at least in some cases). I don't know how hard this would be, but it should help many other games as well.
That's largely already the case, which is also why it's now practical to just block instead of spinning when e.g. there are no CS commands in the queues. If there are still some cases where we wait when we could avoid to do so, we should "fix" those.
I generally agree with Paul here. It's perhaps also worth pointing out that one of the main points of the csmt mechanism in the first place was serialisation of D3D commands. One could perhaps argue that that's unlikely to be an issue for applications limited to running on a single core, but it doesn't seem like a very robust argument.
Something worth exploring might be to run the CS thread as a Unix thread; I think those may not be affected by the affinity mask set for Windows processes. That's not quite trivial, of course.
Separate from the merits of the general approach, I don't think we should override a csmt setting explicitly set by the user. If we're going to do this kind of thing at all, it would need to be before applying settings from the registry and/or environment.
If there are still some cases where we wait when we could avoid to do so, we should "fix" those.
Well, looks like index buffers are still mapped through CS. Can we use the chunk allocator for them?
If there are still some cases where we wait when we could avoid to do so, we should "fix" those.
Well, looks like index buffers are still mapped through CS. Can we use the chunk allocator for them?
Only when using the GL backend (there is no such limitation with the Vulkan backend). We can't trivially enable suballocation for index buffers on GL because there is no GL API that allows to offset the index buffer, specifically WRT indirect draws.
Do we still need something like this? CPU usage is much lower, and all of the linked bugs are marked fixed.
Note that it's not quite as simple as disabling CSMT either, because the CSMT thread has functional benefits as well (e.g. marshalling occlusion queries onto a single thread). If we really want something like this we'd have to find a different way to accomplish it.
This merge request was closed by Anton Baskanov.