Based on [a patch](https://www.winehq.org/mailman3/hyperkitty/list/wine-devel@winehq.org/messag...) by Jinoh Kang (@iamahuman) from February 2022.
I removed the need for the event object and implemented fast paths for Linux. On macOS 10.14+ `thread_get_register_pointer_values` is called on every thread of the process. On Linux 4.14+ `membarrier(MEMBARRIER_CMD_GLOBAL_EXPEDITED, ...)` is used. On x86 Linux <= 4.13 and on other platforms `madvise(..., MADV_DONTNEED)` is used, which sends IPIs to all cores causing them to do a memory barrier.
-- v11: ntdll: Add thread_get_register_pointer_values-based implementation of NtFlushProcessWriteBuffers. ntdll: Add sys_membarrier-based implementation of NtFlushProcessWriteBuffers. ntdll: Add MADV_DONTNEED-based implementation of NtFlushProcessWriteBuffers.
From: Torge Matthies tmatthies@codeweavers.com
Credits to Avi Kivity (scylladb) and Aliaksei Kandratsenka (gperftools) for this trick, see [1].
[1] https://github.com/scylladb/seastar/commit/77a58e4dc020233f66fccb8d9e8f7a8b7... --- dlls/ntdll/unix/virtual.c | 52 +++++++++++++++++++++++++++++++++++++- tools/winapi/nativeapi.dat | 1 + 2 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/dlls/ntdll/unix/virtual.c b/dlls/ntdll/unix/virtual.c index 0faf3e343e3..a6fb19c807a 100644 --- a/dlls/ntdll/unix/virtual.c +++ b/dlls/ntdll/unix/virtual.c @@ -216,6 +216,9 @@ struct range_entry static struct range_entry *free_ranges; static struct range_entry *free_ranges_end;
+static void *dontneed_page; +static pthread_mutex_t dontneed_page_mutex = PTHREAD_MUTEX_INITIALIZER; +
static inline BOOL is_beyond_limit( const void *addr, size_t size, const void *limit ) { @@ -5170,13 +5173,60 @@ NTSTATUS WINAPI NtFlushInstructionCache( HANDLE handle, const void *addr, SIZE_T }
+static BOOL try_madvise( void ) +{ +#ifdef __aarch64__ + static int once = 0; +#endif + BOOL success = FALSE; + char *mem; + + pthread_mutex_lock(&dontneed_page_mutex); + /* Credits to Avi Kivity (scylladb) and Aliaksei Kandratsenka (gperftools) for this trick, + see https://github.com/scylladb/seastar/commit/77a58e4dc020233f66fccb8d9e8f7a8b7... */ + mem = dontneed_page; + if (!mem) + { + int ret; + /* Allocate one page of memory that we can call madvise() on */ + mem = anon_mmap_alloc( page_size, PROT_READ | PROT_WRITE ); + if (mem == MAP_FAILED) + goto failed; + /* If the memory is locked, e.g. by a call to mlockall(MCL_FUTURE), the madvise() call below + will fail with error EINVAL, so unlock it here */ + ret = munlock( mem, page_size ); + /* munlock() may fail on old kernels if we don't have sufficient permissions, but that is not + a problem since in that case we didn't have permission to lock the memory either */ + if (ret && errno != EPERM) + goto failed; + dontneed_page = mem; + } + /* Force the page into memory to make madvise() have real work to do */ + *mem = 3; + /* Evict the page from memory to force the kernel to send an IPI to all threads of this process, + which has the side effect of executing a memory barrier in those threads */ + success = !madvise( mem, page_size, MADV_DONTNEED ); +#ifdef __aarch64__ + /* Some ARMv8 processors can broadcast TLB invalidations using the TLBI instruction, + the madvise trick does not work on those */ + if (success && !once++) + FIXME( "memory barrier may not work on this platform\n" ); +#endif +failed: + pthread_mutex_unlock(&dontneed_page_mutex); + return success; +} + + /********************************************************************** * NtFlushProcessWriteBuffers (NTDLL.@) */ void WINAPI NtFlushProcessWriteBuffers(void) { static int once = 0; - if (!once++) FIXME( "stub\n" ); + if (try_madvise()) + return; + if (!once++) FIXME( "no implementation available on this platform\n" ); }
diff --git a/tools/winapi/nativeapi.dat b/tools/winapi/nativeapi.dat index ade20b5ee68..5512c4f1833 100644 --- a/tools/winapi/nativeapi.dat +++ b/tools/winapi/nativeapi.dat @@ -134,6 +134,7 @@ log10 logb longjmp lseek +madvise malloc mblen memccpy
From: Torge Matthies tmatthies@codeweavers.com
Uses the MEMBARRIER_CMD_PRIVATE_EXPEDITED membarrier command introduced in Linux 4.14. --- dlls/ntdll/unix/virtual.c | 47 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+)
diff --git a/dlls/ntdll/unix/virtual.c b/dlls/ntdll/unix/virtual.c index a6fb19c807a..ee2a12ecd54 100644 --- a/dlls/ntdll/unix/virtual.c +++ b/dlls/ntdll/unix/virtual.c @@ -39,6 +39,9 @@ #ifdef HAVE_SYS_SYSINFO_H # include <sys/sysinfo.h> #endif +#ifdef HAVE_SYS_SYSCALL_H +# include <sys/syscall.h> +#endif #ifdef HAVE_SYS_SYSCTL_H # include <sys/sysctl.h> #endif @@ -216,6 +219,11 @@ struct range_entry static struct range_entry *free_ranges; static struct range_entry *free_ranges_end;
+#if defined(__linux__) && defined(__NR_membarrier) +static BOOL membarrier_exp_available; +static pthread_once_t membarrier_init_once = PTHREAD_ONCE_INIT; +#endif + static void *dontneed_page; static pthread_mutex_t dontneed_page_mutex = PTHREAD_MUTEX_INITIALIZER;
@@ -5173,6 +5181,43 @@ NTSTATUS WINAPI NtFlushInstructionCache( HANDLE handle, const void *addr, SIZE_T }
+#if defined(__linux__) && defined(__NR_membarrier) + +#define MEMBARRIER_CMD_QUERY 0x00 +#define MEMBARRIER_CMD_PRIVATE_EXPEDITED 0x08 +#define MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED 0x10 + +static int membarrier( int cmd, unsigned int flags, int cpu_id ) +{ + return syscall( __NR_membarrier, cmd, flags, cpu_id ); +} + +static void membarrier_init( void ) +{ + static const int exp_required_cmds = + MEMBARRIER_CMD_PRIVATE_EXPEDITED | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED; + int available_cmds = membarrier( MEMBARRIER_CMD_QUERY, 0, 0 ); + if (available_cmds == -1) + return; + if ((available_cmds & exp_required_cmds) == exp_required_cmds) + membarrier_exp_available = !membarrier( MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0 ); +} + +static BOOL try_exp_membarrier( void ) +{ + pthread_once(&membarrier_init_once, membarrier_init); + if (!membarrier_exp_available) + return FALSE; + return !membarrier( MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0, 0 ); +} + +#else /* defined(__linux__) && defined(__NR_membarrier) */ + +static BOOL try_exp_membarrier( void ) { return 0; } + +#endif /* defined(__linux__) && defined(__NR_membarrier) */ + + static BOOL try_madvise( void ) { #ifdef __aarch64__ @@ -5224,6 +5269,8 @@ failed: void WINAPI NtFlushProcessWriteBuffers(void) { static int once = 0; + if (try_exp_membarrier()) + return; if (try_madvise()) return; if (!once++) FIXME( "no implementation available on this platform\n" );
From: Torge Matthies tmatthies@codeweavers.com
--- dlls/ntdll/unix/virtual.c | 70 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+)
diff --git a/dlls/ntdll/unix/virtual.c b/dlls/ntdll/unix/virtual.c index ee2a12ecd54..7a5778a6d8c 100644 --- a/dlls/ntdll/unix/virtual.c +++ b/dlls/ntdll/unix/virtual.c @@ -65,6 +65,9 @@ #if defined(__APPLE__) # include <mach/mach_init.h> # include <mach/mach_vm.h> +# include <mach/task.h> +# include <mach/thread_state.h> +# include <mach/vm_map.h> #endif
#include "ntstatus.h" @@ -219,6 +222,11 @@ struct range_entry static struct range_entry *free_ranges; static struct range_entry *free_ranges_end;
+#ifdef __APPLE__ +static kern_return_t (*p_thread_get_register_pointer_values)( thread_t, uintptr_t*, size_t*, uintptr_t* ); +static pthread_once_t tgrpvs_init_once = PTHREAD_ONCE_INIT; +#endif + #if defined(__linux__) && defined(__NR_membarrier) static BOOL membarrier_exp_available; static pthread_once_t membarrier_init_once = PTHREAD_ONCE_INIT; @@ -5181,6 +5189,66 @@ NTSTATUS WINAPI NtFlushInstructionCache( HANDLE handle, const void *addr, SIZE_T }
+#ifdef __APPLE__ + +static void tgrpvs_init( void ) +{ + p_thread_get_register_pointer_values = dlsym( RTLD_DEFAULT, "thread_get_register_pointer_values" ); +} + +static BOOL try_mach_tgrpvs( void ) +{ + /* Taken from https://github.com/dotnet/runtime/blob/7be37908e5a1cbb83b1062768c1649827eeac... */ + mach_msg_type_number_t count, i = 0; + thread_act_array_t threads; + kern_return_t kret; + BOOL success = FALSE; + + pthread_once(&tgrpvs_init_once, tgrpvs_init); + if (!p_thread_get_register_pointer_values) + return FALSE; + + /* Get references to all threads of this process */ + kret = task_threads( mach_task_self(), &threads, &count ); + if (kret) + return FALSE; + + /* Iterate through the threads in the list */ + while (i < count) + { + uintptr_t reg_values[128]; + size_t reg_count = ARRAY_SIZE( reg_values ); + uintptr_t sp; + + /* Request the thread's register pointer values to force the thread to go through a memory barrier */ + kret = p_thread_get_register_pointer_values( threads[i], &sp, ®_count, reg_values ); + /* This function always fails when querying Rosetta's exception handling thread, so we only treat + KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */ + if (kret == KERN_INSUFFICIENT_BUFFER_SIZE) + goto fail; + + /* Deallocate thread reference once we're done with it */ + kret = mach_port_deallocate( mach_task_self(), threads[i++] ); + if (kret) + goto fail; + } + success = TRUE; +fail: + /* Deallocate remaining thread references */ + while (i < count) + mach_port_deallocate( mach_task_self(), threads[i++] ); + /* Deallocate thread list */ + vm_deallocate( mach_task_self(), (vm_address_t)threads, count * sizeof(threads[0]) ); + return success; +} + +#else /* defined(__APPLE__) */ + +static BOOL try_mach_tgrpvs( void ) { return 0; } + +#endif /* defined(__APPLE__) */ + + #if defined(__linux__) && defined(__NR_membarrier)
#define MEMBARRIER_CMD_QUERY 0x00 @@ -5269,6 +5337,8 @@ failed: void WINAPI NtFlushProcessWriteBuffers(void) { static int once = 0; + if (try_mach_tgrpvs()) + return; if (try_exp_membarrier()) return; if (try_madvise())
test-linux-32 has problems, as always. The HTTP status code is already set to 200 before/during the callback for the `BINDSTATUS_CONNECTING`. But idk exactly why. The `BINDSTATUS_CONNECTING` comes from `HttpSendRequest(Ex)?W`, while the status 200 is only set in `HttpEndRequestW`, which should come after that.
Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:
if (mem == MAP_FAILED)
goto failed;
/* If the memory is locked, e.g. by a call to mlockall(MCL_FUTURE), the madvise() call below
will fail with error EINVAL, so unlock it here */
ret = munlock( mem, page_size );
/* munlock() may fail on old kernels if we don't have sufficient permissions, but that is not
a problem since in that case we didn't have permission to lock the memory either */
if (ret && errno != EPERM)
goto failed;
dontneed_page = mem;
- }
- /* Force the page into memory to make madvise() have real work to do */
- *mem = 3;
- /* Evict the page from memory to force the kernel to send an IPI to all threads of this process,
which has the side effect of executing a memory barrier in those threads */
- success = !madvise( mem, page_size, MADV_DONTNEED );
It turns out that `MADV_DONTNEED` semantics is [not portable]. [For example, it doesn't necessarily trigger TLB shootdown immediately in native FreeBSD][freebsd_dontneed]. I'm not sure ScyllaDB has first-class support for non-Linux OSes, either.
I suggest reverting to mprotect(), as documented in the paper I cited earlier in this MR discussion.
[not portable]: https://www.man7.org/linux/man-pages/man2/madvise.2.html#DESCRIPTION [freebsd_dontneed]: https://github.com/freebsd/freebsd-src/blob/23d4d0fcc1be3d2f44054dd12725098a...
Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:
- if (kret)
return FALSE;
- /* Iterate through the threads in the list */
- while (i < count)
- {
uintptr_t reg_values[128];
size_t reg_count = ARRAY_SIZE( reg_values );
uintptr_t sp;
/* Request the thread's register pointer values to force the thread to go through a memory barrier */
kret = p_thread_get_register_pointer_values( threads[i], &sp, ®_count, reg_values );
/* This function always fails when querying Rosetta's exception handling thread, so we only treat
KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */
if (kret == KERN_INSUFFICIENT_BUFFER_SIZE)
goto fail;
This error should be rare enough, so we don't have to bail out early for performance. It's sufficient to just set `success = FALSE;`[^1] here and continue looping. This will eliminate the extra port deallocate loop at `fail:`.
[^1]: After initializing `success` to `TRUE`.
Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:
- {
uintptr_t reg_values[128];
size_t reg_count = ARRAY_SIZE( reg_values );
uintptr_t sp;
/* Request the thread's register pointer values to force the thread to go through a memory barrier */
kret = p_thread_get_register_pointer_values( threads[i], &sp, ®_count, reg_values );
/* This function always fails when querying Rosetta's exception handling thread, so we only treat
KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */
if (kret == KERN_INSUFFICIENT_BUFFER_SIZE)
goto fail;
/* Deallocate thread reference once we're done with it */
kret = mach_port_deallocate( mach_task_self(), threads[i++] );
if (kret)
goto fail;
Likewise, breaking out of loop seems unnecessary. We don't even have to fail, since the barrier operation itself isn't affected by this. If we want to do some error handling then `ERR()` logging should be enough.
Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:
kret = p_thread_get_register_pointer_values( threads[i], &sp, ®_count, reg_values );
/* This function always fails when querying Rosetta's exception handling thread, so we only treat
KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */
if (kret == KERN_INSUFFICIENT_BUFFER_SIZE)
goto fail;
/* Deallocate thread reference once we're done with it */
kret = mach_port_deallocate( mach_task_self(), threads[i++] );
if (kret)
goto fail;
- }
- success = TRUE;
+fail:
- /* Deallocate remaining thread references */
- while (i < count)
mach_port_deallocate( mach_task_self(), threads[i++] );
(to continue from above, note that we don't check for the error here; we don't need it anyway.)
Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:
a problem since in that case we didn't have permission to lock the memory either */
if (ret && errno != EPERM)
goto failed;
dontneed_page = mem;
- }
- /* Force the page into memory to make madvise() have real work to do */
- *mem = 3;
- /* Evict the page from memory to force the kernel to send an IPI to all threads of this process,
which has the side effect of executing a memory barrier in those threads */
- success = !madvise( mem, page_size, MADV_DONTNEED );
+#ifdef __aarch64__
- /* Some ARMv8 processors can broadcast TLB invalidations using the TLBI instruction,
the madvise trick does not work on those */
- if (success && !once++)
FIXME( "memory barrier may not work on this platform\n" );
+#endif
I think this should be unconditional and not specific to aarch64. `mprotect()` is and always will be a hack, and other archs might implement TLBI like operation too.
Hello! It's been a while. I'm not sure if you're still interested in submitting this patch, so feel free to disregard them if you wish.