I made a small perf text just in case (attached), which of course can't catch the difference of RtlWakeAddressAll() time due to pre-existing much greater variance in the execution time. The change is very subtle perf wise.
Unrelated here, but the test shows some things interesting by itsef. E. g., the time of RtlWakeAddressAll with 6 threads is more than 2 times bigger than on Windows, and as far as my further profiling went, the vast majority of time is spent in Linux futex_wake() (not even in multiple NtAlerthThreadById calls through dispacther, hacking those in one call instead of 6 improves things aon average a bit but not significantly).
[waitonaddr.c](/uploads/f4ac4129f4fb5ccfb0d364313936c90c/waitonaddr.c)