The `RtlRunOnce` family of functions are implemented using (a variant of) the Double Checked Locking Pattern (DCLP). The DCLP requires memory fences for correctness, on both the writer and reader sides. It's pretty obvious why it's needed on the writer side, but quite surprising that any is needed on the reader side! On strong memory model architectures (x86 and x86_64), only compiler-level fences are required. On weak memory model architectures like ARM64, instead, you need both CPU and compiler fences.
That's explained well in books like _Concurrent Programming on Windows_ by Joe Duffy and in online resources like [1].
The Wine implementation has fences on the writing side (`RtlRunOnceComplete`). That's because `InterlockedCompareExchangePointer` inserts a full memory fence. However some code paths on the reader side (`RtlRunOnceBeginInitialize`) are missing fences, specifically the (`RTL_RUN_ONCE_CHECK_ONLY`) branch and the (`!RTL_RUN_ONCE_CHECK_ONLY && (val & 3) == 2`) branch.
Add the missing fences using GCC's atomic builtins [2]
Note: with this MR, the generated code should change only for ARM64
### References:
1. [Double-Checked Locking is Fixed In C++11](https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/) 2. [GCC's Built-in Functions for Memory Model Aware Atomic Operations](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html)
From: Luca Bacci luca.bacci982@gmail.com
This is needed on architectures with weak memory models like ARM64. --- dlls/ntdll/sync.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/dlls/ntdll/sync.c b/dlls/ntdll/sync.c index 522ce0a2142..8c4e2dbf79b 100644 --- a/dlls/ntdll/sync.c +++ b/dlls/ntdll/sync.c @@ -61,7 +61,7 @@ DWORD WINAPI RtlRunOnceBeginInitialize( RTL_RUN_ONCE *once, ULONG flags, void ** { if (flags & RTL_RUN_ONCE_CHECK_ONLY) { - ULONG_PTR val = (ULONG_PTR)once->Ptr; + ULONG_PTR val = (ULONG_PTR)__atomic_load_n (&once->Ptr, __ATOMIC_ACQUIRE);
if (flags & RTL_RUN_ONCE_ASYNC) return STATUS_INVALID_PARAMETER; if ((val & 3) != 2) return STATUS_UNSUCCESSFUL; @@ -71,7 +71,7 @@ DWORD WINAPI RtlRunOnceBeginInitialize( RTL_RUN_ONCE *once, ULONG flags, void **
for (;;) { - ULONG_PTR next, val = (ULONG_PTR)once->Ptr; + ULONG_PTR next, val = (ULONG_PTR)__atomic_load_n (&once->Ptr, __ATOMIC_ACQUIRE);
switch (val & 3) {
You should probably use `ReadAcquire`, there's some shenanigans with __atomic builtins being broken on old GCC versions.
Thanks!
Is it ok to introduce a new helper in `winnt.h` for pointer values? Otherwise I have to use preprocessor checks to select between `ReadAcquire` and `ReadAcquire64`
Windows SDK conditionally defines a `ReadLongPtrAcquire` macro to either one, we should probably do the same.
On Wed Oct 15 09:34:07 2025 +0000, Rémi Bernon wrote:
Windows SDK conditionally defines a `ReadLongPtrAcquire` macro to either one, we should probably do the same.
At least Windows 10 SDK defines ReadPointerAcquire() which uses either ReadAcquire() or ReadAcquire64() depending on platform.
On Wed Oct 15 09:34:07 2025 +0000, Dmitry Timoshkov wrote:
At least Windows 10 SDK defines ReadPointerAcquire() which uses either ReadAcquire() or ReadAcquire64() depending on platform.
Ah yeah that one as well, probably better for pointers.
On Wed Oct 15 09:42:01 2025 +0000, Rémi Bernon wrote:
Ah yeah that one as well, probably better for pointers.
Great! I have found previous work by @yshui: https://gitlab.winehq.org/wine/wine/-/merge_requests/3504/diffs?commit_id=a6...