Supersedes !741.
On macOS 10.14+ `thread_get_register_pointer_values` is called on every thread of the process. On Linux 4.14+ `membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED, ...)` is used. On x86 Linux <= 4.13 and on other platforms it falls back to calling `NtGetContextThread()` on each thread.
The fast path patches from @tmatthies are slightly modified in the following ways:
1. On unsupported platforms, the `try_*()` functions return `FALSE` instead of `0`. 2. `try_exp_membarrier()` is called first, then `try_mach_tgrpvs()`.
---
Known applications fixed by this MR:
- osu! (rhythm game) song selection menu stuttering - .NET CoreCLR GC - HotSpot JVM (-XX:+UseSystemMemoryBarrier) safepoints
-- v3: kernel32/tests: Add a store buffering litmus test involving FlushProcessWriteBuffers. ntdll: Add slow fallback implementation of NtFlushProcessWriteBuffers.