https://bugs.winehq.org/show_bug.cgi?id=54979
Bug ID: 54979 Summary: Futexes which protect memory allocation congest, causing 100x performance degradation Product: Wine Version: 8.0.1 Hardware: x86-64 OS: Linux Status: UNCONFIRMED Severity: normal Priority: P2 Component: ntdll Assignee: wine-bugs@winehq.org Reporter: nagle@animats.com Distribution: ---
Created attachment 74518 --> https://bugs.winehq.org/attachment.cgi?id=74518 Debugger output with multiple threads stuck in spinlocks.
SUMMARY
Under the wrong circumstances, a program doing many memory allocations from multiple threads can experience more than 100x performance degradation.
I'm seeing this in my own Rust program, which has 16 sometimes-compute-bound threads on a 12 CPU machine. This works fine on Microsoft Windows and when compiled for Linux. Under Wine, it can be over 100x slower.
Multiple threads are banging on a spinlock at
_InterlockedCompareExchange [Z:\usr\src\packages\BUILD\include\winnt.h:6630] in ntdll spin_lock [Z:\usr\src\packages\BUILD\dlls\ntdll\sync.c:937] in ntdll
This is the spin lock around the queue for pausing threads.
DISCUSSION
This looks like futex congestion. That's a known kind of problem, but doesn't seem to have been seen in Wine before. It may be that some programs just run really slow under Wine, and nobody knows why.
The code in NTDLL is clearly intended to handle the fast case without doing an OS-level lock. Under light load, that's O(1). But under heavy load, with multiple threads pounding on spinlocks, each spinlock waits longer, and there are multiple threads, so it becomes O(N^2). Which means huge slowdowns.
See discussion at https://forum.winehq.org/viewtopic.php?t=37688 which has snippets of the code that's causing the problem. I can copy all that info over here if necessary. Let me know.
https://bugs.winehq.org/show_bug.cgi?id=54979
Zeb Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |z.figura12@gmail.com
--- Comment #1 from Zeb Figura z.figura12@gmail.com --- I haven't been able to find a way to avoid the spinlock for RtlWaitOnAddress. That primitive needs to be fair *and* support timeouts *and* as fast as possible, and I haven't been able to find a faster way than user-space spinlocks.
However, we can probably avoid RtlWaitOnAddress for critical sections.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #2 from John Nagle nagle@animats.com --- I don't see an easy way to eliminate that innermost spinlock, either.
It's not a long lock. The slowest thing inside is searching a linear list of waiting threads, which ought to take maybe a microsecond. So it may be that the outer loops are asking for that lock too often.
There are three nested locks here, two futexes and a hard spinlock. The one that looks most suspicious is the middle one at wait_semaphore. That loop repeatedly calls RtlWaitOnAddress, which is probably why control is usually in the innermost spinlock on a debugger break. If wait_semaphore didn't do that, the problem might go away.
The outermost lock has
if (crit->LockCount > 0) break; /* more than one waiter, don't bother spinning */
so it's protected against this kind of problem. If that lock starts to build up a backlog, the thread blocks rather than spinning. But wait_semaphore doesn't have similar protection.
Thanks for looking at this.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #3 from Zeb Figura z.figura12@gmail.com --- (In reply to Zeb Figura from comment #1)
However, we can probably avoid RtlWaitOnAddress for critical sections.
This is harder than I thought. I thought it would be relatively easy to implement a lock-free wait queue on the stacks of the waiting threads, but after thinking it through I don't think it's actually possible at all.
In theory we could try supplying hints as to which thread to wake (say, with a bounded ring buffer). The problem with this is that we only have a pointer's worth of space to work with, and we can't allocate more because we underly the heap implementation.
(In reply to John Nagle from comment #2)
There are three nested locks here, two futexes and a hard spinlock. The one that looks most suspicious is the middle one at wait_semaphore. That loop repeatedly calls RtlWaitOnAddress, which is probably why control is usually in the innermost spinlock on a debugger break. If wait_semaphore didn't do that, the problem might go away.
I don't think this is suspicious. We use RtlWakeAddressSingle(), and win32 futexes are fair, so in theory a given waiting thread should only actually be woken once. At worst I think it can have one stale buffered wakeup from a previous alert.
https://bugs.winehq.org/show_bug.cgi?id=54979
mqudsi mqudsi@neosmart.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |mqudsi@neosmart.net
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #4 from John Nagle nagle@animats.com --- Any progress on this?
What usually sets this off is growing an array. That calls realloc, and attempts to grow the buffer in place. If that's not possible due to a lack of free space after the buffer, the realloc code allocates a new buffer, copies the data, and releases the old buffer. The bottleneck seems to be at the point where the old buffer is released. That's somewhat unexpected.
A test case might involve a large number of compute-bound threads, all growing arrays slowly.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #5 from Zeb Figura z.figura12@gmail.com --- (In reply to John Nagle from comment #4)
Any progress on this?
No, I don't have any ideas for optimizing critical sections or Win32 futexes.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #6 from John Nagle nagle@animats.com --- Hm. Real Windows does not seem to have this problem. What does Microsoft do?
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #7 from Zeb Figura z.figura12@gmail.com --- (In reply to John Nagle from comment #6)
Hm. Real Windows does not seem to have this problem. What does Microsoft do?
I don't know. Unfortunately I don't think it'll be easy to find out; note that because of copyright we can't go poking around in Windows internals.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #8 from John Nagle nagle@animats.com --- Any progress on this?
You're not likely to get as good a bug report from anyone else. If this happens in some game, they won't have source code. I do, and was able to see the problem in GDB. There's probably some "My Game Hangs" bug report that's really this problem, but can't be diagnosed.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #9 from Zeb Figura z.figura12@gmail.com --- (In reply to John Nagle from comment #8)
Any progress on this?
You're not likely to get as good a bug report from anyone else. If this happens in some game, they won't have source code. I do, and was able to see the problem in GDB. There's probably some "My Game Hangs" bug report that's really this problem, but can't be diagnosed.
I understand, I'm thankful for providing a clear bug report. The thing is, this bug is not going to be fixed until someone has an idea for how to write a fast user-space wait queue that doesn't allocate and whose "head" can fit in a pointer.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #10 from John Nagle nagle@animats.com --- Right.
You're three deep in futexes at that point, and I think they're interacting badly. I think it's the one in the middle that's part of the problem. But I'm not sure. GDB gives me a snapshot of the state but not the dynamics that got us there.
Maybe if the middle spinlock had a quick quit to a hard block...
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #11 from Zeb Figura z.figura12@gmail.com --- (In reply to John Nagle from comment #10)
Maybe if the middle spinlock had a quick quit to a hard block...
Sorry, what do you mean by "the middle spinlock"? I guess you probably mean futex_queue.lock, since it's the only spinlock.
We can't block there, because we have no fast way to wake up a waiter. [And frankly, even if we could block, I worry about the performance implications.]
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #12 from John Nagle nagle@animats.com --- There is an outer level of spinlock.
See summary here: https://forum.winehq.org/viewtopic.php?t=37688
As far as I can tell, this is not fine-grained locking. It's just one big lock for the whole heap. That should be a short spinlock, which, if it times out, causes a kernel-level lock and block.
The bottleneck is that when you realloc a buffer, the old one may have to be recopied, and the heap is locked up during that period.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #13 from Zeb Figura z.figura12@gmail.com --- (In reply to John Nagle from comment #12)
There is an outer level of spinlock.
See summary here: https://forum.winehq.org/viewtopic.php?t=37688
Ah. That's not really a spinlock, though. The CS is a hybrid lock, but not a true spinlock, and we don't use the spin count for the heap anyway, so it's not even a hybrid lock.
As far as I can tell, this is not fine-grained locking. It's just one big lock for the whole heap.
That's true, if making heap locking more fine-grained would be enough then that's an alternate solution. I don't really have the time/interest to implement that myself, though.
That should be a short spinlock, which, if it times out, causes a kernel-level lock and block.
It is. In fact, it's not even the spinlock; we just block immediately. The problem is that in order to block we need to spin-lock the internal wait queue.
There are other ways to do a kernel wait, but they come with a significant performance cost.
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #14 from John Nagle nagle@animats.com --- Also reported as a Rend3 bug: https://github.com/BVE-Reborn/rend3/issues/570
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #15 from Zeb Figura z.figura12@gmail.com --- (In reply to John Nagle from comment #14)
Also reported as a Rend3 bug: https://github.com/BVE-Reborn/rend3/issues/570
That seems to be a different bug?
https://bugs.winehq.org/show_bug.cgi?id=54979
--- Comment #16 from John Nagle nagle@animats.com --- Wrong bug. Sorry.