On Wed Jul 10 13:07:06 2024 +0000, Grigory Vasilyev wrote:
You are right FUTEX_WAKE makes the mutex noticeably slower. t1 - is custom mutex t2 - pthread mutex With FUTEX_WAKE:
Time elapsed: t1=89.413000ms, t2=5.800000ms
Without:
Time elapsed: t1=3.665000ms, t2=5.786000ms
simple benchmark:
#include <stdatomic.h> #include <stdint.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <pthread.h> #include <linux/futex.h> #include <sys/syscall.h> #include <omp.h> #define WINE_MUTEX_TYPE _Atomic unsigned int #define WINE_MUTEX_INIT ATOMIC_VAR_INIT(0) #define WINE_MUTEX_LOCK(RESOURCE) do { \ unsigned int expected = 0; \ while(!atomic_compare_exchange_weak(RESOURCE, &expected, 1)) { \ syscall(SYS_futex, RESOURCE, FUTEX_WAIT, 1, NULL, NULL, 0); \ } \ } while(0) #define WINE_MUTEX_UNLOCK(RESOURCE) do { \ atomic_store(RESOURCE, 0); \ } while(0) #define COUNT 1000000 void test(){ size_t t1_sum, t1_part; size_t t2_sum, t2_part; clock_t t1_start, t1_stop; clock_t t2_start, t2_stop; double t1_elapsed, t2_elapsed; WINE_MUTEX_TYPE m1 = WINE_MUTEX_INIT; pthread_mutex_t m2 = PTHREAD_MUTEX_INITIALIZER; t1_start = clock(); #pragma omp parallel private(t1_part) shared(t1_sum) { t1_sum = 0; t1_part = 0; #pragma omp for { for (size_t i = 0; i < COUNT; i++) { WINE_MUTEX_LOCK(&m1); t1_part = t1_part + i; WINE_MUTEX_UNLOCK(&m1); } } #pragma omp critical { t1_sum += t1_part; } } t1_stop = clock(); t2_start = clock(); #pragma omp parallel private(t2_part) shared(t2_sum) { t2_sum = 0; t2_part = 0; #pragma omp for { for (size_t i = 0; i < COUNT; i++) { pthread_mutex_lock(&m2); t2_part = t2_part + i; pthread_mutex_unlock(&m2); } } #pragma omp critical { t2_sum += t2_part; } } t2_stop = clock(); printf("t1=%zu td2=%zu\n", t1_sum, t2_sum); t1_elapsed = (double)(t1_stop - t1_start) * 1000.0 / CLOCKS_PER_SEC; t2_elapsed = (double)(t2_stop - t2_start) * 1000.0 / CLOCKS_PER_SEC; printf("Time elapsed: t1=%fms, t2=%fms\n", t1_elapsed, t2_elapsed); } int main() { test(); return 0; }
clang -O2 -std=gnu17 test_mutex.c -o test_mutex OMP_NUM_THREADS=8; export OMP_NUM_THREADS ./test_mutex
While this implementation works, it is inefficient when releasing an uncontended lock: releasing the lock involves making a system call to futex_wake, even though there is no thread waiting for the lock.
~quote from your github link
If there are no waiters, then there's nobody to wake up, and the syscall does nothing - but merely entering a syscall takes a lot longer than an atomic operation or two.
There's a better implementation in that exact repo.
While 3.6ms is indeed less than 5.7ms, that's just a synthetic benchmark containing the mutex and exactly nothing else. I'm curious if you can find any benchmark that shows an improvement to Wine itself, or if the difference becomes too small to measure.
I agree with extracting this thing from winewayland. Mutexes are common, improving them for just one small piece is a weird choice. However, unlike winewayland, ntdll runs on mac, which doesn't have futexes, so you need to keep the pthread path working.