How about making the membarrier object per-process instead?
I don't think I can/should do that, since then if thread 2 issues a memory barrier while thread 1 is already waiting on a memory barrier, then thread 1 will also wait for the second memory barrier to complete instead of just its own.