Rebased and made it so that there are no performance regressions on x86_64 anymore.
Did a bit more testing on arm64 Windows as well and am now relatively certain it does a Query + Write + Flush in the good case and a Query + Protect + Write + Protect + Flush in the bad case, just by adding up the timings and the fact that it does seem to always flush, which isn't really testable on x86_64.
I don't think any application relies on this sequence of events though (hopefully), so it is probably good to leave it as is, given that it offers the same behaviour now.