UFFD supports the functionality similar to Windows write watches starting from kernel 6.7.
1. Motivation.
a) The primary motivation for this patchset is performance improvement. The issue was originally observed with Streets of Rage 4 game which was sometimes loading leves for about 1-1.5 min (while that was happening almost instantly on Windows). The performance problem is coming from .Net Core memory management / garbage collector code. That huge performance issue was caused not only by direct turnaround difference for write watched memory access but also by specifics of the alogrithm which was using different memory pools / strategies efectively based on memory access timing and was resulting in especially unfortunate pattern hitting write-protected (for write watch) pages much more often than it normally would. Since then something has changed either in the game or in .Net Core it is using, and the difference in level loads is not that dramatic without this optimization, but still it is some 6-8 sec without this patch instead of 1.5-2sec with this patch or on Windows, suggesting that it still provides a huge performance improvement at least for .Net Core memory management.
I am also attaching an ad-hoc microbenchmark program with the results comparing Windows times and the times with / without the patch, all on the same machine. Here are the results: ``` Parameters: - number of concurrent threads; - number of pages; - delay between reading / resetting write watches (ms) - random (1) or sequentual (0) page write access; - reset with WRITE_WATCH_FLAG_RESET in GetWriteWatch (1) or in a separate ResetWriteWatch call (0).
Result is in the form of <average write to page time, ns> / <average GetWriteWatch() time, ms>
Parameters Windows Kernel watches Fallback 6 1080 3 1 1 210 / 40 178 / 40 1800 / 1000 6 1080 3 1 0 210 / 45 175 / 45 1300 / 1500 8 8192 3 1 1 290 / 290 245 / 295 70000 / 1800 8 8192 3 1 0 290 / 275 250 / 310 73000 / 1750 8 8192 1 1 1 410 / 265 340 / 285 71000 / 1480 8 8192 1 1 0 400 / 375 350 / 300 73000 / 1450 8 64 1 1 1 245 / 5 210 / 7 230 / 10 8 64 1 1 0 245 / 6 205 / 8 235 / 10 ```
The most signigicant perf difference comes from segfault handling for pages protected for the write watches. The major overhead comes not only from segfault turnaround itself, but also from that: - mprotect() is expensive as it splits / merges VMAs on Linux which happens very often with the memory ranges fragmented by write protection for write watches; - all that happens within a global lock in Wine, so while such a segfault is handled parallel accesses to write watched memory and other VM operations are blocked.
b) There are cases when current way of handling write watches just doesn't work. Things like (https://github.com/misyltoad/apitrace) want to track write watches on GPU accessed memory, and write protected memory provided for Vulkan resources doesn't work at least with AMD driver (while dirty pages through UFFD do work with that).
2. Caveats
There are two things which are going to work different with this patch:
a) Unlike generic file I/O, socket I/O being done to write-watched buffers doesn't result in pages marked in dirty on Windows. That is currently supported (specifically for socket I/O, while there might be more cases) in Wine write watches, and this MR turns that into TODO. It is possible to implement on top of this patch, I even have a WIP local implementation of that but it complicates the code a lot (introducing mechanics to sync local vprot with kernel watched ones, resetting kernel side watches when necessary and then joining the results of kernel-reported and Wine-tracked dirty pages). Also, the currently implemented way is racy. I believe that most likely nothing depends on handling this corner case in the wild. As per my surfing of commit history, this handling and corresponding tests were introduced as a part of general handling WW-protected memory for socket IO (before which it was just broken, attempt to recv() to such memory would just fail which is not the case with the present patch). Then, the case looks very special, and also the patch similar to this one was in Proton for years and is not known to introduce problems (before UFFD support was added in kernel 6.7 there was similar mechanism introduced with a custom patch in SteamOS and some custom kernels which Proton was using).
So it seems to me that unless there is anything known to depend on that it doesn't worth the complication, while it is possible to do.
b) The patch changes corner case behaviour WRT preserving dirty pages status across decommitting / comitting pages coupled with protection change. This is where I mostly added additional changes. The pre-existing test could make an impression that write watches are preserved through decommit / commit (and current implementation works like that), but my additional tests shows that it is not the case actually and the preservation rather has the convoluted dependence on protection change. So handling of that is not fully correct neither with nor without my changes. I believe the specifically current behaviour is unlikely to be dependent upon, basically for the same reasons as in p. a). Yet it should be possible to be implement correctly on top, while it looks a bit more convoluted than p. a.
[win2.c](/uploads/c1c8b69304d81ae80aae0c45e6ef3484/win2.c)