https://bugs.winehq.org/show_bug.cgi?id=50292
Bug ID: 50292 Summary: Process-local synchronization objects use private interfaces into the Unix library Product: Wine Version: 6.0-rc1 Hardware: x86-64 OS: Linux Status: NEW Severity: normal Priority: P2 Component: ntdll Assignee: wine-bugs@winehq.org Reporter: z.figura12@gmail.com Distribution: ---
This is not exactly a bug, but it is something we'd like to change about the ntdll code, and we need somewhere to track https://github.com/wine-staging/wine-staging/tree/master/patches/ntdll-NtAlertThreadByThreadId. The basic "problem" is that process-local synchronization APIs [condition variables, SRW locks, critical sections, Win32 futexes] use unixlib vectors [fast_RtlpWaitForCriticalSection() et al.] instead of syscalls.
Some testing suggests that Win32 futexes and condition variables are implemented on top of an internal, undocumented interface using the functions NtAlertThreadByThreadId() and NtWaitForAlertByThreadId(), which are themselves presumably syscalls. In particular, a thread waiting inside RtlSleepConditionVariable*() or RtlWaitOnAddress() can be awoken with NtAlertThreadByThreadId(). Because this interface is meant specifically for process-local synchronization (in particular, it is not possible to alert another process's thread), it is a good fit for Wine, both for those objects as well as SRW locks and critical sections. Accordingly I have created a patch set which implements these Nt* interfaces, and reimplements the Rtl* APIs on top of them.
There is a problem with these patches, however. Some performance testing done by Etienne Juvigny reveals that Star Citizen, a game which makes heavy use of Win32 futexes, suffers a drop in performance from these patches. On one machine, configured to be maximally CPU-limited, performance drops from 92 to 80 FPS. There are two potential reasons for this:
(1) locking of TEB lists is slow, both on the PE-side and the Unix-side. This is especially likely given that performance suffered massively before the locks were converted to read-write locks. However, removing these locks entirely is difficult if not impossible.
(2) RtlWakeAddressAll() performs multiple syscalls (one per thread waiting on the given address). In Star Citizen, this can be up to six threads in normal usage patterns. This seems to be true on Windows as well: NtAlertThreadByThreadId() only seems capable of waking one thread at a time, and I can find no similar API that wakes more than one thread.
https://bugs.winehq.org/show_bug.cgi?id=50292
Zebediah Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Staged patchset| |https://github.com/wine-sta | |ging/wine-staging/tree/mast | |er/patches/ntdll-NtAlertThr | |eadByThreadId Status|NEW |STAGED
https://bugs.winehq.org/show_bug.cgi?id=50292
winetaste@gmx.net changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |winetaste@gmx.net
https://bugs.winehq.org/show_bug.cgi?id=50292
mirh mirh@protonmail.ch changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |mirh@protonmail.ch
https://bugs.winehq.org/show_bug.cgi?id=50292
Alexis Peypelut iroalexis@outlook.fr changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |iroalexis@outlook.fr
https://bugs.winehq.org/show_bug.cgi?id=50292
Kim Malmo berencamlost@msn.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |berencamlost@msn.com
https://bugs.winehq.org/show_bug.cgi?id=50292
braiamp@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |braiamp@gmail.com
https://bugs.winehq.org/show_bug.cgi?id=50292
Rémi Bernon rbernon@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |rbernon@codeweavers.com
--- Comment #1 from Rémi Bernon rbernon@codeweavers.com --- Hi, I had a quick look at the patch series and here's a few nitpicks that I've found:
In 0004-ntdll-Implement-NtAlertThreadByThreadId-and-NtWaitFo.patch:
if (teb->ClientId.UniqueThread == tid)
{
pthread_rwlock_unlock( &teb_list_lock );
NtSetEvent( thread_data->tid_alert_event, NULL );
return STATUS_SUCCESS;
}
I think there's a race condition here, were the thread could potentially be interrupted after the TEB lock is released, but before the event is set.
The other thread that thread_data refers to may then terminate, the NtSetEvent call may set an non-existing event, or worse if the TEB is reused, and the new thread waiting itself, wake a wrong thread.
It's probably unlikely to happen but from a correctness point of view I think it's wrong.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #2 from Rémi Bernon rbernon@codeweavers.com ---
In 0006-ntdll-Implement-thread-id-alerts-on-top-of-futexes-i.patch
while (!InterlockedExchange( futex, 0 ))
{
if (timeout)
{
LONGLONG timeleft = update_timeout( end );
struct timespec timespec;
timespec.tv_sec = timeleft / (ULONGLONG)TICKSPERSEC;
timespec.tv_nsec = (timeleft % TICKSPERSEC) * 100;
ret = futex_wait( futex, 0, ×pec );
}
else
ret = futex_wait( futex, 0, NULL );
if (ret == -1 && errno == ETIMEDOUT) return STATUS_TIMEOUT;
}
If FUTEX_WAIT_BITSET is available, I think you can use it to wait with an absolute timeout, which could save calls to NtQuerySystemTime.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #3 from Rémi Bernon rbernon@codeweavers.com --- (In reply to Rémi Bernon from comment #2)
If FUTEX_WAIT_BITSET is available, I think you can use it to wait with an absolute timeout, which could save calls to NtQuerySystemTime.
Or maybe it's on purpose because of CLOCK_MONOTONIC NTP adjustments? In which case please just ignore my comment.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #4 from Zebediah Figura z.figura12@gmail.com --- (In reply to Rémi Bernon from comment #1)
Hi, I had a quick look at the patch series and here's a few nitpicks that I've found:
In 0004-ntdll-Implement-NtAlertThreadByThreadId-and-NtWaitFo.patch:
if (teb->ClientId.UniqueThread == tid)
{
pthread_rwlock_unlock( &teb_list_lock );
NtSetEvent( thread_data->tid_alert_event, NULL );
return STATUS_SUCCESS;
}
I think there's a race condition here, were the thread could potentially be interrupted after the TEB lock is released, but before the event is set.
The other thread that thread_data refers to may then terminate, the NtSetEvent call may set an non-existing event, or worse if the TEB is reused, and the new thread waiting itself, wake a wrong thread.
It's probably unlikely to happen but from a correctness point of view I think it's wrong.
I think you're right, this is a race and there's not much of a penalty to fixing it.
(In reply to Rémi Bernon from comment #3)
(In reply to Rémi Bernon from comment #2)
If FUTEX_WAIT_BITSET is available, I think you can use it to wait with an absolute timeout, which could save calls to NtQuerySystemTime.
Or maybe it's on purpose because of CLOCK_MONOTONIC NTP adjustments? In which case please just ignore my comment.
Probably, yes, we'd need a RAW flag. Note though that the timeout is always relative when called through kernel32, so it doesn't help us much anyway.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #5 from Rémi Bernon rbernon@codeweavers.com --- (In reply to Zebediah Figura from comment #4)
(In reply to Rémi Bernon from comment #3)
(In reply to Rémi Bernon from comment #2)
If FUTEX_WAIT_BITSET is available, I think you can use it to wait with an absolute timeout, which could save calls to NtQuerySystemTime.
Or maybe it's on purpose because of CLOCK_MONOTONIC NTP adjustments? In which case please just ignore my comment.
Probably, yes, we'd need a RAW flag.
Actually, calling NtQuerySystemTime also suffers from the same issue, as from what I can see, it uses CLOCK_MONOTONIC.
Note though that the timeout is always relative when called through kernel32, so it doesn't help us much anyway.
Yes sure, and we always need to convert it to absolute first, but then we could call futex_wait with the absolute timeout directly and drop the update_timeout thing, making the code simpler.
https://bugs.winehq.org/show_bug.cgi?id=50292
Paul Gofman pgofman@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |pgofman@codeweavers.com
--- Comment #6 from Paul Gofman pgofman@codeweavers.com --- (In reply to Zebediah Figura from comment #4)
Probably, yes, we'd need a RAW flag. Note though that the timeout is always relative when called through kernel32, so it doesn't help us much anyway.
Please note that using CLOCK_MONOTONIC_RAW instead of CLOCK_MONOTONIC may end up in a syscall instead of VDSO call, so it will have performance impact.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #7 from Rémi Bernon rbernon@codeweavers.com --- Regarding the performance issue, I think the linked-list nature of the TEB list could also be part of issue, especially when a large number of threads are created. On synthetic benchmarks, it's very clear that it induces a lot of CPU cache misses.
IIUC it's not really possible to avoid a global lock here. As NtAlertThreadByThreadId function is supposed to return a specific status if the targeted TID does not exist, I guess there's no other way than tracking all threads.
However, it should be possible to make it faster by having a static TID + sync object array instead, to help the CPU speculative execution do its magic. Of course it will then have to support a dynamic number of threads, but perhaps a reasonably large fixed maximum number may be acceptable and could still perform better than the linked list?
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #8 from Zebediah Figura z.figura12@gmail.com --- (In reply to Rémi Bernon from comment #7)
Regarding the performance issue, I think the linked-list nature of the TEB list could also be part of issue, especially when a large number of threads are created. On synthetic benchmarks, it's very clear that it induces a lot of CPU cache misses.
IIUC it's not really possible to avoid a global lock here. As NtAlertThreadByThreadId function is supposed to return a specific status if the targeted TID does not exist, I guess there's no other way than tracking all threads.
However, it should be possible to make it faster by having a static TID + sync object array instead, to help the CPU speculative execution do its magic. Of course it will then have to support a dynamic number of threads, but perhaps a reasonably large fixed maximum number may be acceptable and could still perform better than the linked list?
It should be possible to do it lock-free with the same mechanism as we use on the PE side, i.e. a growable linked list of arrays. I hadn't though of that when I first wrote the patch. It was my vague plan to try that optimization and see if it helps.
https://bugs.winehq.org/show_bug.cgi?id=50292
Jacek Caban jacek@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jacek@codeweavers.com
--- Comment #9 from Jacek Caban jacek@codeweavers.com --- I took a look at the series and have some suggestions. It would be good to make NtAlertThreadByThreadId lock-free, but I'd suggest to try to do that in a spirit similar to fd cache. Looking at pid/tid allocation strattegy in server, we'd rarely need more than one cache array allocation for entire process.
It should be possible to do tid->TEB mapping this way. Another possible way would be to simply map to a thread-specific int value. That value could be then used directly as a futex or HANDLE storage (and maybe some special value to be able to distinguish between non-waiting and non-existing tid, if needed).
For PE side of things, I wonder if you really need so complicated wait entry allocation. Could we just store wait entry on waiting thread's stack?
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #10 from Zebediah Figura z.figura12@gmail.com --- (In reply to Jacek Caban from comment #9)
I took a look at the series and have some suggestions. It would be good to make NtAlertThreadByThreadId lock-free, but I'd suggest to try to do that in a spirit similar to fd cache. Looking at pid/tid allocation strattegy in server, we'd rarely need more than one cache array allocation for entire process.
The only problem with that is that if we run out of space in the fd cache, the only effect is we have to go to the server each time. If we run out of space in the TID/addr or TID/TEB mappings, the functions can't work. Maybe that's not worth worrying about, though...
It should be possible to do tid->TEB mapping this way. Another possible way would be to simply map to a thread-specific int value. That value could be then used directly as a futex or HANDLE storage (and maybe some special value to be able to distinguish between non-waiting and non-existing tid, if needed).
You mean, map to the value directly rather than mapping to the TEB and looking up the value from there? Sure, and if we make it lock-free that definitely seems like a good idea.
For PE side of things, I wonder if you really need so complicated wait entry allocation. Could we just store wait entry on waiting thread's stack?
We need some way to append and arbitrarily remove it from a list of waiting threads. It would have been better than the original approach (i.e. iterating over TEBs), but it would still require locking (afaik only singly linked lists can be lock-free, and you can't pop from the middle of one).
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #11 from Jacek Caban jacek@codeweavers.com --- (In reply to Zebediah Figura from comment #10)
The only problem with that is that if we run out of space in the fd cache, the only effect is we have to go to the server each time. If we run out of space in the TID/addr or TID/TEB mappings, the functions can't work. Maybe that's not worth worrying about, though...
We can ensure that the cache slot is allocated somewhere in NtCreateThreadEx, where we can properly propagate error.
You mean, map to the value directly rather than mapping to the TEB and looking up the value from there?
Yes.
For PE side of things, I wonder if you really need so complicated wait entry allocation. Could we just store wait entry on waiting thread's stack?
We need some way to append and arbitrarily remove it from a list of waiting threads. It would have been better than the original approach (i.e. iterating over TEBs), but it would still require locking (afaik only singly linked lists can be lock-free, and you can't pop from the middle of one).
Reading it again in more details, it's probably better to stick with your current solution for PE part.
FWIW, we wouldn't need to pop from a middle if we had a separated lists for each address. The problem would be lock-free mapping from input address to such a single listed list. This could be done, but 64-bit key size makes it much less appealing (than, say, 30-bit key for fd cache).
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #12 from Zebediah Figura z.figura12@gmail.com --- (In reply to Jacek Caban from comment #11)
We can ensure that the cache slot is allocated somewhere in NtCreateThreadEx, where we can properly propagate error.
I guess putting a hard limit on the number of threads works too...
FWIW, we wouldn't need to pop from a middle if we had a separated lists for each address. The problem would be lock-free mapping from input address to such a single listed list. This could be done, but 64-bit key size makes it much less appealing (than, say, 30-bit key for fd cache).
Applying a hash map might help in general. I'm not sure how even separate lists for each address helps us, though. The waits have an arbitrary timeout, so it'd still be possible to pop from the middle.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #13 from Jacek Caban jacek@codeweavers.com --- (In reply to Zebediah Figura from comment #12)
(In reply to Jacek Caban from comment #11)
We can ensure that the cache slot is allocated somewhere in NtCreateThreadEx, where we can properly propagate error.
I guess putting a hard limit on the number of threads works too...
I'm not sure how that would help, I guess that we're thinking about different implementation details. There is a global limit of threads implied by the fact that thread ids are 30 bits (a DWORD value with two least significant bit ignored). We could have tid_cache similar to fd_cache:
- in NtCreateThreadEx: ensure that tid_cache[tid >> tid_cache_shift] is allocated and mark tid_cache[tid >> tid_cache_shift][tid & tid_cache_mask] as valid
- in NtWaitForAlertByThreadId and NtAlertThreadByThreadId: use only tid_cache for thread-specific data
- when terminating thread: mark appropriate tid_cache as invalid
FWIW, we wouldn't need to pop from a middle if we had a separated lists for each address. The problem would be lock-free mapping from input address to such a single listed list. This could be done, but 64-bit key size makes it much less appealing (than, say, 30-bit key for fd cache).
Applying a hash map might help in general. I'm not sure how even separate lists for each address helps us, though. The waits have an arbitrary timeout, so it'd still be possible to pop from the middle.
Oh right, I forgot about timeouts.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #14 from Zebediah Figura z.figura12@gmail.com --- (In reply to Jacek Caban from comment #13)
I'm not sure how that would help, I guess that we're thinking about different implementation details. There is a global limit of threads implied by the fact that thread ids are 30 bits (a DWORD value with two least significant bit ignored). We could have tid_cache similar to fd_cache:
- in NtCreateThreadEx: ensure that tid_cache[tid >> tid_cache_shift] is
allocated and mark tid_cache[tid >> tid_cache_shift][tid & tid_cache_mask] as valid
- in NtWaitForAlertByThreadId and NtAlertThreadByThreadId: use only
tid_cache for thread-specific data
- when terminating thread: mark appropriate tid_cache as invalid
Sure. On the other hand, even for a two-tiered array like that 2**30 (times 8 or 16 bytes) is arguably a lot of storage to reserve. I don't really know what's reasonable, though.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #15 from Zebediah Figura z.figura12@gmail.com --- Wait, maybe I'm stupid: were you proposing a tid -> alert primitive cache *in addition to* the TEB list that already exists? I was assuming it to be a replacement. (Honestly, if it's reasonable to replace it entirely, and just fail NtCreateThreadEx() if we run out of space, I'd rather do that, but...)
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #16 from Jacek Caban jacek@codeweavers.com --- I meant it as a replacement, not an addition. I may have been an overly optimistic about just 2 dim array, 3 dim could be less wasteful (although there may be a better structure). The point is to make a single tid slot cheap enough (4 bytes seems more than enough, but it could be even larger if needed) so that we may over-allocate and never free. Over-allocation makes allocation of a single slot cheap (because it will usually be already allocated) and never freeing avoids any need for locking.
It may seem like a wasteful strategy, but note that server gives us a good locality of tid values and reuses previously freed ids. In practice, I'd expect that a single block of a few thousands entries should be enough for majority of applications. If we optimize for that (while being able to handle even a theoretical case of 2**30 created threads with enough allocations), I'd hope that we can get decent results. I can't be sure without trying, maybe I'm missing something.
https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #17 from Zebediah Figura z.figura12@gmail.com --- (In reply to Jacek Caban from comment #16)
I meant it as a replacement, not an addition. I may have been an overly optimistic about just 2 dim array, 3 dim could be less wasteful (although there may be a better structure). The point is to make a single tid slot cheap enough (4 bytes seems more than enough, but it could be even larger if needed) so that we may over-allocate and never free. Over-allocation makes allocation of a single slot cheap (because it will usually be already allocated) and never freeing avoids any need for locking.
It may seem like a wasteful strategy, but note that server gives us a good locality of tid values and reuses previously freed ids. In practice, I'd expect that a single block of a few thousands entries should be enough for majority of applications. If we optimize for that (while being able to handle even a theoretical case of 2**30 created threads with enough allocations), I'd hope that we can get decent results. I can't be sure without trying, maybe I'm missing something.
Okay, that's good, that sounds like the ideal strategy to me too. I just wasn't sure if there was a reasonable tradeoff between maximum number of threads possible + amount of memory unnecessarily reserved + lookup time. (I am less than completely enthused about the idea of a 3-dimensional list—I think we should probably be optimizing for lookup time above all else.)
I think we do need 8 bytes, at least for 64-bit, or perhaps just for 64-bit Mac. A futex is 4, but on Mac we use semaphore_t, which is a pointer (though I haven't looked to see if there's a better solution), and if neither is available we use an event (we could possibly use a keyed event, though we'd have to introduce some locking—but we're already on the slow path).
Thinking out loud here: a two-dimensional array of 65536 bytes * 4096 entries would give us 2**26 threads on 32-bit, 2**25 on 64-bit, while using 16+(4 or 8) pages in the common case. 32-bit has a much lower limit imposed by the stack size, so that's not a concern, but is that enough threads for 64-bit? Raising that to 2**30 for 64-bit would be a reservation of 65536 entries = 128 pages, which is kind of a lot, though in practice only the first is ever faulted in and we have VA space to spare.
Does that sound like a reasonable compromise? I think we could even shrink the 32-bit reservations so to use less address space.
https://bugs.winehq.org/show_bug.cgi?id=50292
Zebediah Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|STAGED |RESOLVED Fixed by SHA1| |f3e9498740d675b74d0f84b8120 | |ef7e1e2122ec1 Resolution|--- |FIXED
--- Comment #18 from Zebediah Figura z.figura12@gmail.com --- Fixed upstream as of https://source.winehq.org/git/wine.git/commitdiff/f3e9498740d675b74d0f84b8120ef7e1e2122ec1.
https://bugs.winehq.org/show_bug.cgi?id=50292
Alexandre Julliard julliard@winehq.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED
--- Comment #19 from Alexandre Julliard julliard@winehq.org --- Closing bugs fixed in 6.22.