On Sun Sep 4 13:46:59 2022 +0000, Jinoh Kang wrote:
I don't think I can/should do that, since then if thread 2 issues a
memory barrier while thread 1 is already waiting on a memory barrier, then thread 1 will also wait for the second memory barrier to complete instead of just its own. We can serialize the barrier calls with mutex here, too. It will avoid excessive APCs in case multiple threads call NtFlushProcessWriteBuffers (e.g. RCU with concurrent writers, GC in multiple arena/isolates). If we don't serialize the barrier or coalesce APCs, the total number of simultaneous APCs will be `NM` where N = number of threads in the process, and M = concurrent calls to NtFlushProcessWriteBuffers. Since not all applications use membarrier in the first place, we can also avoid extra object allocation for threads that will never end up using the global memory barrier. (Yet another approach to solve this problem would be keeping track of generations. It will let us coalesce APCs, but this sounds like an overkill.) In general, we want to minimize the complexity and overhead of the fallback path since its use will not be very common: newest operating systems will just use mprotect/membarrier/mach calls, and the fallback is only used when all else fails.
I protected the APC path with a mutex and made the memory barrier object a global object that is only created once. This means that the `wake_up(...)` calls might do a little unnecessary work if multiple processes issue a memory barrier at the same time but I don't think that matters much and we don't have to create one object per process (or thread).
I have though up a way to coalesce the APCs too but it's more complex and doesn't easily allow reporting back errors to the origin thread. Not sure if I should implement it?