[PATCH v11 0/3] MR741: ntdll: Implement NtFlushProcessWriteBuffers.

List overview All Threads

newer

older

[PATCH v10 0/6] MR569:...

[PATCH v4 0/3] MR629:...

Torge Matthies (＠tmatthies)

28 Mar 2023 28 Mar '23

10:53 a.m.

Based on [a patch](https://www.winehq.org/mailman3/hyperkitty/list/wine-devel@winehq.org/messag...) by Jinoh Kang (@iamahuman) from February 2022.

I removed the need for the event object and implemented fast paths for Linux. On macOS 10.14+ `thread_get_register_pointer_values` is called on every thread of the process. On Linux 4.14+ `membarrier(MEMBARRIER_CMD_GLOBAL_EXPEDITED, ...)` is used. On x86 Linux <= 4.13 and on other platforms `madvise(..., MADV_DONTNEED)` is used, which sends IPIs to all cores causing them to do a memory barrier.

-- v11: ntdll: Add thread_get_register_pointer_values-based implementation of NtFlushProcessWriteBuffers. ntdll: Add sys_membarrier-based implementation of NtFlushProcessWriteBuffers. ntdll: Add MADV_DONTNEED-based implementation of NtFlushProcessWriteBuffers.

https://gitlab.winehq.org/wine/wine/-/merge_requests/741

Show replies by date

Torge Matthies

28 Mar 28 Mar

10:53 a.m.

New subject: [PATCH v11 1/3] ntdll: Add MADV_DONTNEED-based implementation of NtFlushProcessWriteBuffers.

From: Torge Matthies tmatthies@codeweavers.com

Credits to Avi Kivity (scylladb) and Aliaksei Kandratsenka (gperftools) for this trick, see [1].

[1] https://github.com/scylladb/seastar/commit/77a58e4dc020233f66fccb8d9e8f7a8b7... --- dlls/ntdll/unix/virtual.c | 52 +++++++++++++++++++++++++++++++++++++- tools/winapi/nativeapi.dat | 1 + 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/dlls/ntdll/unix/virtual.c b/dlls/ntdll/unix/virtual.c index 0faf3e343e3..a6fb19c807a 100644 --- a/dlls/ntdll/unix/virtual.c +++ b/dlls/ntdll/unix/virtual.c @@ -216,6 +216,9 @@ struct range_entry static struct range_entry *free_ranges; static struct range_entry *free_ranges_end;

+static void *dontneed_page; +static pthread_mutex_t dontneed_page_mutex = PTHREAD_MUTEX_INITIALIZER; +

static inline BOOL is_beyond_limit( const void *addr, size_t size, const void *limit ) { @@ -5170,13 +5173,60 @@ NTSTATUS WINAPI NtFlushInstructionCache( HANDLE handle, const void *addr, SIZE_T }

+static BOOL try_madvise( void ) +{ +#ifdef __aarch64__ + static int once = 0; +#endif + BOOL success = FALSE; + char *mem; + + pthread_mutex_lock(&dontneed_page_mutex); + /* Credits to Avi Kivity (scylladb) and Aliaksei Kandratsenka (gperftools) for this trick, + see https://github.com/scylladb/seastar/commit/77a58e4dc020233f66fccb8d9e8f7a8b7... */ + mem = dontneed_page; + if (!mem) + { + int ret; + /* Allocate one page of memory that we can call madvise() on */ + mem = anon_mmap_alloc( page_size, PROT_READ | PROT_WRITE ); + if (mem == MAP_FAILED) + goto failed; + /* If the memory is locked, e.g. by a call to mlockall(MCL_FUTURE), the madvise() call below + will fail with error EINVAL, so unlock it here */ + ret = munlock( mem, page_size ); + /* munlock() may fail on old kernels if we don't have sufficient permissions, but that is not + a problem since in that case we didn't have permission to lock the memory either */ + if (ret && errno != EPERM) + goto failed; + dontneed_page = mem; + } + /* Force the page into memory to make madvise() have real work to do */ + *mem = 3; + /* Evict the page from memory to force the kernel to send an IPI to all threads of this process, + which has the side effect of executing a memory barrier in those threads */ + success = !madvise( mem, page_size, MADV_DONTNEED ); +#ifdef __aarch64__ + /* Some ARMv8 processors can broadcast TLB invalidations using the TLBI instruction, + the madvise trick does not work on those */ + if (success && !once++) + FIXME( "memory barrier may not work on this platform\n" ); +#endif +failed: + pthread_mutex_unlock(&dontneed_page_mutex); + return success; +} + + /********************************************************************** * NtFlushProcessWriteBuffers (NTDLL.@) */ void WINAPI NtFlushProcessWriteBuffers(void) { static int once = 0; - if (!once++) FIXME( "stub\n" ); + if (try_madvise()) + return; + if (!once++) FIXME( "no implementation available on this platform\n" ); }

diff --git a/tools/winapi/nativeapi.dat b/tools/winapi/nativeapi.dat index ade20b5ee68..5512c4f1833 100644 --- a/tools/winapi/nativeapi.dat +++ b/tools/winapi/nativeapi.dat @@ -134,6 +134,7 @@ log10 logb longjmp lseek +madvise malloc mblen memccpy

-- GitLab https://gitlab.winehq.org/wine/wine/-/merge_requests/741

Torge Matthies

10:53 a.m.

New subject: [PATCH v11 2/3] ntdll: Add sys_membarrier-based implementation of NtFlushProcessWriteBuffers.

From: Torge Matthies tmatthies@codeweavers.com

Uses the MEMBARRIER_CMD_PRIVATE_EXPEDITED membarrier command introduced in Linux 4.14. --- dlls/ntdll/unix/virtual.c | 47 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+)

diff --git a/dlls/ntdll/unix/virtual.c b/dlls/ntdll/unix/virtual.c index a6fb19c807a..ee2a12ecd54 100644 --- a/dlls/ntdll/unix/virtual.c +++ b/dlls/ntdll/unix/virtual.c @@ -39,6 +39,9 @@ #ifdef HAVE_SYS_SYSINFO_H # include <sys/sysinfo.h> #endif +#ifdef HAVE_SYS_SYSCALL_H +# include <sys/syscall.h> +#endif #ifdef HAVE_SYS_SYSCTL_H # include <sys/sysctl.h> #endif @@ -216,6 +219,11 @@ struct range_entry static struct range_entry *free_ranges; static struct range_entry *free_ranges_end;

+#if defined(__linux__) && defined(__NR_membarrier) +static BOOL membarrier_exp_available; +static pthread_once_t membarrier_init_once = PTHREAD_ONCE_INIT; +#endif + static void *dontneed_page; static pthread_mutex_t dontneed_page_mutex = PTHREAD_MUTEX_INITIALIZER;

@@ -5173,6 +5181,43 @@ NTSTATUS WINAPI NtFlushInstructionCache( HANDLE handle, const void *addr, SIZE_T }

+#if defined(__linux__) && defined(__NR_membarrier) + +#define MEMBARRIER_CMD_QUERY 0x00 +#define MEMBARRIER_CMD_PRIVATE_EXPEDITED 0x08 +#define MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED 0x10 + +static int membarrier( int cmd, unsigned int flags, int cpu_id ) +{ + return syscall( __NR_membarrier, cmd, flags, cpu_id ); +} + +static void membarrier_init( void ) +{ + static const int exp_required_cmds = + MEMBARRIER_CMD_PRIVATE_EXPEDITED | MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED; + int available_cmds = membarrier( MEMBARRIER_CMD_QUERY, 0, 0 ); + if (available_cmds == -1) + return; + if ((available_cmds & exp_required_cmds) == exp_required_cmds) + membarrier_exp_available = !membarrier( MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0, 0 ); +} + +static BOOL try_exp_membarrier( void ) +{ + pthread_once(&membarrier_init_once, membarrier_init); + if (!membarrier_exp_available) + return FALSE; + return !membarrier( MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0, 0 ); +} + +#else /* defined(__linux__) && defined(__NR_membarrier) */ + +static BOOL try_exp_membarrier( void ) { return 0; } + +#endif /* defined(__linux__) && defined(__NR_membarrier) */ + + static BOOL try_madvise( void ) { #ifdef __aarch64__ @@ -5224,6 +5269,8 @@ failed: void WINAPI NtFlushProcessWriteBuffers(void) { static int once = 0; + if (try_exp_membarrier()) + return; if (try_madvise()) return; if (!once++) FIXME( "no implementation available on this platform\n" );

-- GitLab https://gitlab.winehq.org/wine/wine/-/merge_requests/741

Torge Matthies

10:53 a.m.

New subject: [PATCH v11 3/3] ntdll: Add thread_get_register_pointer_values-based implementation of NtFlushProcessWriteBuffers.

From: Torge Matthies tmatthies@codeweavers.com

--- dlls/ntdll/unix/virtual.c | 70 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+)

diff --git a/dlls/ntdll/unix/virtual.c b/dlls/ntdll/unix/virtual.c index ee2a12ecd54..7a5778a6d8c 100644 --- a/dlls/ntdll/unix/virtual.c +++ b/dlls/ntdll/unix/virtual.c @@ -65,6 +65,9 @@ #if defined(__APPLE__) # include <mach/mach_init.h> # include <mach/mach_vm.h> +# include <mach/task.h> +# include <mach/thread_state.h> +# include <mach/vm_map.h> #endif

#include "ntstatus.h" @@ -219,6 +222,11 @@ struct range_entry static struct range_entry *free_ranges; static struct range_entry *free_ranges_end;

+#ifdef __APPLE__ +static kern_return_t (*p_thread_get_register_pointer_values)( thread_t, uintptr_t*, size_t*, uintptr_t* ); +static pthread_once_t tgrpvs_init_once = PTHREAD_ONCE_INIT; +#endif + #if defined(__linux__) && defined(__NR_membarrier) static BOOL membarrier_exp_available; static pthread_once_t membarrier_init_once = PTHREAD_ONCE_INIT; @@ -5181,6 +5189,66 @@ NTSTATUS WINAPI NtFlushInstructionCache( HANDLE handle, const void *addr, SIZE_T }

+#ifdef __APPLE__ + +static void tgrpvs_init( void ) +{ + p_thread_get_register_pointer_values = dlsym( RTLD_DEFAULT, "thread_get_register_pointer_values" ); +} + +static BOOL try_mach_tgrpvs( void ) +{ + /* Taken from https://github.com/dotnet/runtime/blob/7be37908e5a1cbb83b1062768c1649827eeac... */ + mach_msg_type_number_t count, i = 0; + thread_act_array_t threads; + kern_return_t kret; + BOOL success = FALSE; + + pthread_once(&tgrpvs_init_once, tgrpvs_init); + if (!p_thread_get_register_pointer_values) + return FALSE; + + /* Get references to all threads of this process */ + kret = task_threads( mach_task_self(), &threads, &count ); + if (kret) + return FALSE; + + /* Iterate through the threads in the list */ + while (i < count) + { + uintptr_t reg_values[128]; + size_t reg_count = ARRAY_SIZE( reg_values ); + uintptr_t sp; + + /* Request the thread's register pointer values to force the thread to go through a memory barrier */ + kret = p_thread_get_register_pointer_values( threads[i], &sp, &reg_count, reg_values ); + /* This function always fails when querying Rosetta's exception handling thread, so we only treat + KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */ + if (kret == KERN_INSUFFICIENT_BUFFER_SIZE) + goto fail; + + /* Deallocate thread reference once we're done with it */ + kret = mach_port_deallocate( mach_task_self(), threads[i++] ); + if (kret) + goto fail; + } + success = TRUE; +fail: + /* Deallocate remaining thread references */ + while (i < count) + mach_port_deallocate( mach_task_self(), threads[i++] ); + /* Deallocate thread list */ + vm_deallocate( mach_task_self(), (vm_address_t)threads, count * sizeof(threads[0]) ); + return success; +} + +#else /* defined(__APPLE__) */ + +static BOOL try_mach_tgrpvs( void ) { return 0; } + +#endif /* defined(__APPLE__) */ + + #if defined(__linux__) && defined(__NR_membarrier)

#define MEMBARRIER_CMD_QUERY 0x00 @@ -5269,6 +5337,8 @@ failed: void WINAPI NtFlushProcessWriteBuffers(void) { static int once = 0; + if (try_mach_tgrpvs()) + return; if (try_exp_membarrier()) return; if (try_madvise())

-- GitLab https://gitlab.winehq.org/wine/wine/-/merge_requests/741

Torge Matthies (＠tmatthies)

12:41 p.m.

test-linux-32 has problems, as always. The HTTP status code is already set to 200 before/during the callback for the `BINDSTATUS_CONNECTING`. But idk exactly why. The `BINDSTATUS_CONNECTING` comes from `HttpSendRequest(Ex)?W`, while the status 200 is only set in `HttpEndRequestW`, which should come after that.

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/741#note_28209

Jinoh Kang (＠iamahuman)

16 Feb 16 Feb

3:34 p.m.

Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:

...

```
   if (mem == MAP_FAILED)
```
```
       goto failed;
```

   /* If the memory is locked, e.g. by a call to mlockall(MCL_FUTURE), the madvise() call below

      will fail with error EINVAL, so unlock it here */

```
   ret = munlock( mem, page_size );
```

   /* munlock() may fail on old kernels if we don't have sufficient permissions, but that is not

      a problem since in that case we didn't have permission to lock the memory either */

```
   if (ret && errno != EPERM)
```
```
       goto failed;
```
```
   dontneed_page = mem;
```
}
/* Force the page into memory to make madvise() have real work to do */
*mem = 3;
/* Evict the page from memory to force the kernel to send an IPI to all threads of this process,

  which has the side effect of executing a memory barrier in those threads */

success = !madvise( mem, page_size, MADV_DONTNEED );

It turns out that `MADV_DONTNEED` semantics is [not portable]. [For example, it doesn't necessarily trigger TLB shootdown immediately in native FreeBSD][freebsd_dontneed]. I'm not sure ScyllaDB has first-class support for non-Linux OSes, either.

I suggest reverting to mprotect(), as documented in the paper I cited earlier in this MR discussion.

[not portable]: https://www.man7.org/linux/man-pages/man2/madvise.2.html#DESCRIPTION [freebsd_dontneed]: https://github.com/freebsd/freebsd-src/blob/23d4d0fcc1be3d2f44054dd12725098a...

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/741#note_61468

Jinoh Kang (＠iamahuman)

3:35 p.m.

Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:

...

if (kret)
```
   return FALSE;
```
/* Iterate through the threads in the list */
while (i < count)
{
```
   uintptr_t reg_values[128];
```

   size_t reg_count = ARRAY_SIZE( reg_values );

```
   uintptr_t sp;
```

   /* Request the thread's register pointer values to force the thread to go through a memory barrier */

   kret = p_thread_get_register_pointer_values( threads[i], &sp, &reg_count, reg_values );

   /* This function always fails when querying Rosetta's exception handling thread, so we only treat

      KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */

   if (kret == KERN_INSUFFICIENT_BUFFER_SIZE)

```
       goto fail;
```

This error should be rare enough, so we don't have to bail out early for performance. It's sufficient to just set `success = FALSE;`[^1] here and continue looping. This will eliminate the extra port deallocate loop at `fail:`.

[^1]: After initializing `success` to `TRUE`.

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/741#note_61469

Jinoh Kang (＠iamahuman)

3:35 p.m.

Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:

...

{
```
   uintptr_t reg_values[128];
```

   size_t reg_count = ARRAY_SIZE( reg_values );

```
   uintptr_t sp;
```

   /* Request the thread's register pointer values to force the thread to go through a memory barrier */

   kret = p_thread_get_register_pointer_values( threads[i], &sp, &reg_count, reg_values );

   /* This function always fails when querying Rosetta's exception handling thread, so we only treat

      KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */

   if (kret == KERN_INSUFFICIENT_BUFFER_SIZE)

```
       goto fail;
```

   /* Deallocate thread reference once we're done with it */

   kret = mach_port_deallocate( mach_task_self(), threads[i++] );

```
   if (kret)
```
```
       goto fail;
```

Likewise, breaking out of loop seems unnecessary. We don't even have to fail, since the barrier operation itself isn't affected by this. If we want to do some error handling then `ERR()` logging should be enough.

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/741#note_61470

Jinoh Kang (＠iamahuman)

3:35 p.m.

Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:

...

   kret = p_thread_get_register_pointer_values( threads[i], &sp, &reg_count, reg_values );

   /* This function always fails when querying Rosetta's exception handling thread, so we only treat

      KERN_INSUFFICIENT_BUFFER_SIZE as an error, like .NET core does. */

   if (kret == KERN_INSUFFICIENT_BUFFER_SIZE)

```
       goto fail;
```

   /* Deallocate thread reference once we're done with it */

   kret = mach_port_deallocate( mach_task_self(), threads[i++] );

```
   if (kret)
```
```
       goto fail;
```
}
success = TRUE;

+fail:

/* Deallocate remaining thread references */
while (i < count)

   mach_port_deallocate( mach_task_self(), threads[i++] );

(to continue from above, note that we don't check for the error here; we don't need it anyway.)

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/741#note_61471

Jinoh Kang (＠iamahuman)

3:35 p.m.

Jinoh Kang (@iamahuman) commented about dlls/ntdll/unix/virtual.c:

...

      a problem since in that case we didn't have permission to lock the memory either */
   if (ret && errno != EPERM)
       goto failed;
   dontneed_page = mem;
}

/* Force the page into memory to make madvise() have real work to do */

*mem = 3;

/* Evict the page from memory to force the kernel to send an IPI to all threads of this process,
  which has the side effect of executing a memory barrier in those threads */
success = !madvise( mem, page_size, MADV_DONTNEED );
+#ifdef __aarch64__
/* Some ARMv8 processors can broadcast TLB invalidations using the TLBI instruction,
  the madvise trick does not work on those */
if (success && !once++)
   FIXME( "memory barrier may not work on this platform\n" );
+#endif

I think this should be unconditional and not specific to aarch64. `mprotect()` is and always will be a hack, and other archs might implement TLBI like operation too.

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/741#note_61472

Jinoh Kang (＠iamahuman)

3:35 p.m.

Hello! It's been a while. I'm not sure if you're still interested in submitting this patch, so feel free to disregard them if you wish.

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/741#note_61473

620

Age (days ago)

945

Last active (days ago)

wine-gitlab@winehq.org

10 comments

3 participants

tags (0)

participants (3)

Jinoh Kang (＠iamahuman)
Torge Matthies
Torge Matthies (＠tmatthies)