On Wed Jul 10 13:07:06 2024 +0000, Grigory Vasilyev wrote:
You are right FUTEX_WAKE makes the mutex noticeably slower. t1 - is custom mutex t2 - pthread mutex With FUTEX_WAKE: ``` Time elapsed: t1=89.413000ms, t2=5.800000ms ``` Without: ``` Time elapsed: t1=3.665000ms, t2=5.786000ms ``` simple benchmark: ```C #include <stdatomic.h> #include <stdint.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <pthread.h> #include <linux/futex.h> #include <sys/syscall.h> #include <omp.h> #define WINE_MUTEX_TYPE _Atomic unsigned int #define WINE_MUTEX_INIT ATOMIC_VAR_INIT(0) #define WINE_MUTEX_LOCK(RESOURCE) do { \ unsigned int expected = 0; \ while(!atomic_compare_exchange_weak(RESOURCE, &expected, 1)) { \ syscall(SYS_futex, RESOURCE, FUTEX_WAIT, 1, NULL, NULL, 0); \ } \ } while(0) #define WINE_MUTEX_UNLOCK(RESOURCE) do { \ atomic_store(RESOURCE, 0); \ } while(0) #define COUNT 1000000 void test(){ size_t t1_sum, t1_part; size_t t2_sum, t2_part; clock_t t1_start, t1_stop; clock_t t2_start, t2_stop; double t1_elapsed, t2_elapsed; WINE_MUTEX_TYPE m1 = WINE_MUTEX_INIT; pthread_mutex_t m2 = PTHREAD_MUTEX_INITIALIZER; t1_start = clock(); #pragma omp parallel private(t1_part) shared(t1_sum) {
t1_sum = 0; t1_part = 0; #pragma omp for { for (size_t i = 0; i < COUNT; i++) { WINE_MUTEX_LOCK(&m1); t1_part = t1_part + i; WINE_MUTEX_UNLOCK(&m1); } } #pragma omp critical { t1_sum += t1_part; } } t1_stop = clock(); t2_start = clock(); #pragma omp parallel private(t2_part) shared(t2_sum) {
t2_sum = 0; t2_part = 0; #pragma omp for { for (size_t i = 0; i < COUNT; i++) { pthread_mutex_lock(&m2); t2_part = t2_part + i; pthread_mutex_unlock(&m2); } } #pragma omp critical { t2_sum += t2_part; } } t2_stop = clock(); printf("t1=%zu td2=%zu\n", t1_sum, t2_sum); t1_elapsed = (double)(t1_stop - t1_start) * 1000.0 / CLOCKS_PER_SEC; t2_elapsed = (double)(t2_stop - t2_start) * 1000.0 / CLOCKS_PER_SEC; printf("Time elapsed: t1=%fms, t2=%fms\n", t1_elapsed, t2_elapsed); } int main() { test(); return 0; } ``` ```bash clang -O2 -std=gnu17 test_mutex.c -o test_mutex OMP_NUM_THREADS=8; export OMP_NUM_THREADS ./test_mutex ``` While this implementation works, it is inefficient when releasing an uncontended lock: releasing the lock involves making a system call to futex_wake, even though there is no thread waiting for the lock.
~quote from your github link If there are no waiters, then there's nobody to wake up, and the syscall does nothing - but merely entering a syscall takes a lot longer than an atomic operation or two. There's a better implementation in that exact repo. While 3.6ms is indeed less than 5.7ms, that's just a synthetic benchmark containing the mutex and exactly nothing else. I'm curious if you can find any benchmark that shows an improvement to Wine itself, or if the difference becomes too small to measure. I agree with extracting this thing from winewayland. Mutexes are common, improving them for just one small piece is a weird choice. However, unlike winewayland, ntdll runs on mac, which doesn't have futexes, so you need to keep the pthread path working. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/6031#note_75779