Non-Wine Dev here, but always interested in achieving top-notch performance in Wine for my own use cases.
I decided to pull in some of this patchset (v5) into a test build of my own custom Wine, strictly incorporating only the ntdll changes (along with adjusting a lock in my wineserver to match, which doesn't exist in upstream Wine). Below are some quick tests I conducted using a Windows executable that performs rapid mutex tests, comparing different versions of Wine.
``` 1. Wine-9.12:
[0] obtained the mutex 93975 times (23493 times per second) [1] obtained the mutex 93979 times (23494 times per second) [2] obtained the mutex 93988 times (23497 times per second) [3] obtained the mutex 93999 times (23499 times per second) [4] obtained the mutex 93977 times (23494 times per second)
2. Wine-Staging-9.12:
[0] obtained the mutex 96179 times (24044 times per second) [1] obtained the mutex 96705 times (24176 times per second) [2] obtained the mutex 96861 times (24215 times per second) [3] obtained the mutex 97627 times (24406 times per second) [4] obtained the mutex 95905 times (23976 times per second)
3. Wine-NSPA: Atomic Locks / FSYNC / RT
[0] obtained the mutex 663948 times (165987 times per second) [1] obtained the mutex 646190 times (161547 times per second) [2] obtained the mutex 615943 times (153985 times per second) [3] obtained the mutex 657453 times (164363 times per second) [4] obtained the mutex 697292 times (174323 times per second)
4. Wine-NSPA: Pi Mutexes / FSYNC / RT
[0] obtained the mutex 696384 times (174096 times per second) [1] obtained the mutex 650608 times (162652 times per second) [2] obtained the mutex 657671 times (164417 times per second) [3] obtained the mutex 665090 times (166272 times per second) [4] obtained the mutex 661732 times (165433 times per second) ```
Obviously, 1/2 don't do very well on these tests (for reasons mentioned below), but I thought I would include them anyway as a baseline of upstream.
3/4 are my own custom builds, which are the easiest for me to test against. These builds are based on Wine-8.19 with plenty of backports, performance enhancements, and RT-related features. In both cases:
-> FSYNC is enabled (futex_waitv, futex ops for Windows synchronization stuff). -> Wineserver is multi-threaded (shmem per thread for server requests/replies). -> RT Scheduling is enabled (Time Critical threads are SCHED_FIFO, the others are SCHED_RR. Wineserver is SCHED_FIFO).
In 3: I am using your atomic locks (v5) within ntdll and wineserver. The rest of Wine is using pi_mutexes (replacing pthread_mutexes), except where FSYNC is used.
In 4: I am using purely pi_mutexes in Wine (e.g., FUTEX_LOCK_PI, FUTEX_CMP_REQUEUE_PI, etc.), except where FSYNC is used... It tented to have less variance between tests.
As you may be aware, Futex PI has more overhead and is slower compared to regular futex ops. So based on that alone, one would think that your code should be able to obtain more mutexes per second, but clearly that isn't the case here (they are very similar, despite pi_mutex likely being 3x slower than your mutexes/locks).
I have my doubts that a faster mutex implementation will have any real benefits in terms of real applications running in Wine (atm). I don't think it's *the* bottleneck or limiting factor here. I mean, isn't there some kind of global locking in the heap code? As #1/2 vs. #3/4 show; Wineserver (single-thread) is a bigger problem, along with the lack of RT scheduling, lack of a better implementation for Windows synchronization primitives (be it Esync, Fsync, MSync, or Winesync/NTSync), and so on.
Anyway, all that aside:
-> You may want to look at where Wine uses pthread condvars, as I am not sure how well that plays with your atomic locks. If I recall correctly, there is also a place or two where pthread rwlocks are used as well.
On a final positive note: I didn't notice any real regressions using your code, and I did test it with some pretty CPU-heavy applications with lots of threads. So that's good, but I also didn't see any indication of improvements over using pi_mutexes either.
fun experiment either way ;-)