(I've seen up to 15x slower)
This is because Window's scheduling granularity is larger, so my test program ended up sleeping longer. When that's accounted for the overall runtime is comparable.
I also measured the overhead of locking operations themselves. In general this MR is comparable with the current implementation, and is a bit faster than native. However, when there is no contention (i.e. only AcquireShared), native is faster, IIRC it is twice as fast vs the current impl, and about 2.5x vs this MR.