The `RtlRunOnce` family of functions are implemented using (a variant of) the Double Checked Locking Pattern (DCLP). The DCLP requires memory fences for correctness, on both the writer and reader sides. It's pretty obvious why it's needed on the writer side, but quite surprising that any is needed on the reader side! On strong memory model architectures (x86 and x86_64), only compiler-level fences are required. On weak memory model architectures like ARM64, instead, you need both CPU and compiler fences.
That's explained well in books like _Concurrent Programming on Windows_ by Joe Duffy and in online resources like [1].
The Wine implementation has fences on the writing side (`RtlRunOnceComplete`). That's because `InterlockedCompareExchangePointer` inserts a full memory fence. However some code paths on the reader side (`RtlRunOnceBeginInitialize`) are missing fences, specifically the (`RTL_RUN_ONCE_CHECK_ONLY`) branch and the (`!RTL_RUN_ONCE_CHECK_ONLY && (val & 3) == 2`) branch.
Add the missing fences using GCC's atomic builtins [2]
Note: with this MR, the generated code should change only for ARM64
### References:
1. [Double-Checked Locking is Fixed In C++11](https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/) 2. [GCC's Built-in Functions for Memory Model Aware Atomic Operations](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html)