This uses the Mach COW mechanism to implement writewatch functionality.
Below is the same micro-benchmark @gofman used in his [UFFD MR](https://gitlab.winehq.org/wine/wine/-/merge_requests/7871).
``` Parameters: - number of concurrent threads; - number of pages; - delay between reading / resetting write watches (ms) - random (1) or sequentual (0) page write access; - reset with WRITE_WATCH_FLAG_RESET in GetWriteWatch (1) or in a separate ResetWriteWatch call (0).
Result is in the form of <average write to page time, ns> / <average GetWriteWatch() time, mcs>
Parameters Windows Mach COW Fallback 6 1080 3 1 1 897 / 80 371 / 12634 66202 / 186 6 1080 3 1 0 855 / 87 369 / 12637 66766 / 187 8 8192 3 1 1 6526 / 268 627 / 113263 111053 / 485 8 8192 3 1 0 1197 / 509 623 / 113810 122921 / 489 8 8192 1 1 1 1227 / 412 636 / 118930 150628 / 388 8 8192 1 1 0 5721 / 144 631 / 120538 146392 / 384 8 64 1 1 1 572 / 7 490 / 1078 1000 / 89 8 64 1 1 0 530 / 13 500 / 1075 1167 / 77 ```
This was all on the same M2 Max machine with Windows being win11 on ARM in a VM running the x64 binary emulated and otherwise Wine through Rosetta with and without this MR.
Unlike UFFD which is always better than fallback and comparable to the Windows performance, here good average write to page time is traded for bad average `GetWriteWatch()` time (pretty much in equal ratios).
However in real world applications (like the FFXIV + Dalamud mod framework/loader use case) the startup time is reduced from about 25.5s to 23.6s with this change from a cold start, including loading a modern dotnet 9 runtime into the game process and initializing a complex mod collection, with a fairly high GC pressure.
This is probably because the `GetWriteWatch()` calls the GC does mostly happen concurrently, whereas in Wines fallback implementation running threads are interrupted and often wait on the global virtual lock in Wine while the segfault is handled and parallel accesses to write watched memory and other VM operations are blocked.
Another advantage is that `VPROT_WRITEWATCH` can be used then for other purposes in the future and also Rosetta being a bit finicky sometimes with reported protections with the current implementation, but behaved always as expected so far in my testing with the new one.
On native ARM64 the `VM_PROT_COPY`/`SM_COW` mechanism works also as expected on native 16k pages (not that this matters much at the moment).
`GetWriteWatch()` with the reset flag also does not need to be transactional (unlike UFFD), since only marked pages are reset here and not the entire range.
-- v2: ntdll: Use Mach COW for write watches support on macOS.
From: Marc-Aurel Zent mzent@codeweavers.com
--- dlls/ntdll/unix/virtual.c | 209 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 209 insertions(+)
diff --git a/dlls/ntdll/unix/virtual.c b/dlls/ntdll/unix/virtual.c index 994f76fb72a..abe42376d5b 100644 --- a/dlls/ntdll/unix/virtual.c +++ b/dlls/ntdll/unix/virtual.c @@ -414,6 +414,215 @@ static void kernel_get_write_watches( void *base, SIZE_T size, void **buffer, UL addr = next_addr; } } +#elif defined(__APPLE__) +static BYTE get_host_page_vprot( const void *addr ); + +static int get_unix_prot( BYTE vprot ); + +static vm_prot_t get_mach_prot( mach_vm_address_t addr ) +{ + BYTE vprot; + int unix_prot; + vm_prot_t mach_prot = VM_PROT_NONE; + + vprot = get_host_page_vprot( (const void *)addr ); + unix_prot = get_unix_prot( vprot ); + + if (unix_prot & PROT_READ) mach_prot |= VM_PROT_READ; + if (unix_prot & PROT_WRITE) mach_prot |= VM_PROT_WRITE; + if (unix_prot & PROT_EXEC) mach_prot |= VM_PROT_EXECUTE; + + return mach_prot; +} + +static void kernel_writewatch_init(void) +{ + use_kernel_writewatch = 1; + TRACE( "Using mach write watches.\n" ); +} + +static void kernel_writewatch_reset( void *start, SIZE_T len ) +{ + mach_vm_address_t current_address = (mach_vm_address_t)ROUND_ADDR( start, host_page_mask ); + SIZE_T end = current_address + ROUND_SIZE( start, len, host_page_mask ); + kern_return_t kr; + + while (current_address < end) + { + vm_prot_t mach_prot = get_mach_prot( current_address ); + + kr = mach_vm_protect( mach_task_self(), current_address, host_page_size, 0, + mach_prot | VM_PROT_COPY ); + + if (kr != KERN_SUCCESS) + { + ERR( "mach_vm_protect failed on address %p: %d\n", (void *)current_address, kr ); + break; + } + + current_address += host_page_size; + } +} + +static void kernel_writewatch_register_range( struct file_view *view, void *base, size_t size ) +{ + mach_vm_address_t current_address = (mach_vm_address_t)ROUND_ADDR( base, host_page_mask ); + mach_vm_address_t region_address; + mach_vm_size_t region_size; + mach_msg_type_number_t info_count; + mach_port_t object_name; + vm_region_extended_info_data_t info; + SIZE_T end = current_address + ROUND_SIZE( base, size, host_page_mask ); + kern_return_t kr; + + if (!(view->protect & VPROT_WRITEWATCH) || !use_kernel_writewatch) return; + + while (current_address < end) + { + vm_prot_t mach_prot = get_mach_prot( current_address ); + + region_address = current_address; + info_count = VM_REGION_EXTENDED_INFO_COUNT; + kr = mach_vm_region( mach_task_self(), ®ion_address, ®ion_size, VM_REGION_EXTENDED_INFO, + (vm_region_info_t)&info, &info_count, &object_name ); + + if (kr != KERN_SUCCESS) + { + ERR( "mach_vm_region failed: %d\n", kr ); + break; + } + + if (region_address > current_address) + { + ERR( "trying to register unmapped region\n" ); + break; + } + + assert( info.protection == mach_prot ); + + /* + * Calling mach_vm_protect with VM_PROT_COPY will create a new shadow object + * for the page, so that we can track writes to it. + * If the page is already COW, this still works and increases shadow depth + * by one even with already existing identical protection. + * We need this per host page, to keep track of the writes when the share + * mode changes to/from SM_COW. + * This operation can always be done (and was even designed for this), + * originally to increase the maximum protection set, but it works well + * for our purpose too. + * Once the page flips back from SM_COW to another share mode (usually + * SM_PRIVATE), XNU might do some funky things like merging regions together + * or even worse keep SM_COW after the write and increase shadow depth + * and point it to a new shadow object with identical contents (usually + * only happens though on native arm64, not on Rosetta). + * This can be still be handled correctly, if we were to keep track of + * the shadow_depth and pages_shared_now_private per page, but this is + * extra complexity we don't need. + * Creating a mach memory entry makes sure the vm_map_entry is backed by + * exactly one unique vm_object and avoids the headaches mentioned above + * and potential submaps. + * This is because mach_make_memory_entry_64() is in essence like the first + * step of a mach_vm_remap() operation, which calls into vm_map_remap_extract() + * and ensures the above requirement. + * The cleanup happens once the last reference to the vm_entry port and the + * mapped memory at that address is deallocated. + */ + + region_size = (mach_vm_size_t)host_page_size; + kr = mach_vm_protect( mach_task_self(), current_address, region_size, 0, + mach_prot | VM_PROT_COPY ); + + if (kr != KERN_SUCCESS) + { + ERR( "mach_vm_protect failed: %d\n", kr ); + break; + } + + kr = mach_make_memory_entry_64( mach_task_self(), ®ion_size, current_address, mach_prot, + &object_name, MACH_PORT_NULL ); + + if (kr != KERN_SUCCESS) + { + ERR( "mach_make_memory_entry_64 failed: %d\n", kr ); + current_address += host_page_size; + continue; + } + + assert( region_size == host_page_size ); + mach_port_deallocate( mach_task_self(), object_name ); + current_address += host_page_size; + } +} + +static void kernel_get_write_watches( void *base, SIZE_T size, void **buffer, ULONG_PTR *count, BOOL reset ) +{ + mach_vm_address_t current_address; + mach_vm_address_t region_address; + mach_vm_size_t region_size; + mach_msg_type_number_t info_count; + mach_port_t object_name; + vm_region_extended_info_data_t info; + data_size_t remaining_size; + SIZE_T buffer_len = *count; + size_t end; + kern_return_t kr; + + assert( !(size & page_mask) ); + + end = (size_t)((char *)base + size); + remaining_size = ROUND_SIZE( base, size, host_page_mask ); + current_address = (mach_vm_address_t)ROUND_ADDR( base, host_page_mask ); + *count = 0; + + while (remaining_size && buffer_len) + { + region_address = current_address; + info_count = VM_REGION_EXTENDED_INFO_COUNT; + kr = mach_vm_region( mach_task_self(), ®ion_address, ®ion_size, VM_REGION_EXTENDED_INFO, + (vm_region_info_t)&info, &info_count, &object_name ); + + if (kr != KERN_SUCCESS) + { + ERR( "mach_vm_region failed: %d\n", kr ); + break; + } + + if (region_address > min( current_address, (mach_vm_address_t)end )) break; + + if (info.share_mode != SM_COW) + { + size_t c_addr = max( (size_t)current_address, (size_t)base ); + size_t region_end = min( (size_t)(region_address + region_size), end ); + + while (buffer_len && c_addr < region_end) + { + buffer[(*count)++] = (void *)c_addr; + --buffer_len; + c_addr += page_size; + } + } + + current_address += region_size; + remaining_size -= region_size; + } + + if (reset) + { + ULONG_PTR i; + vm_prot_t mach_prot; + + for (i = 0; i < *count; i++) + { + current_address = (mach_vm_address_t)buffer[i]; + mach_prot = get_mach_prot( current_address ); + kr = mach_vm_protect( mach_task_self(), current_address, page_size, 0, + mach_prot | VM_PROT_COPY ); + + if (kr != KERN_SUCCESS) + ERR( "mach_vm_protect failed: %d\n", kr ); + } + } +} #else static void kernel_writewatch_init(void) {
On Fri Oct 3 10:30:47 2025 +0000, Marc-Aurel Zent wrote:
changed this line in [version 2 of the diff](/wine/wine/-/merge_requests/9090/diffs?diff_id=214154&start_sha=62d2baa3f3185f1e4c6fcc7a018c8863432a2477#584f4313ed133393a8f1903c21dcaf4967ba7ab9_429_428)
I wanted to avoid forward declaring get_host_page_vprot and get_unix_prot, but maybe that was not worth the effort.
And indeed if we wanted to build a 32-bit unix side on older OSX versions !_WIN64 would be relevant here.
I changed it in v2 to use the existing implementation, which is probably cleaner all in all (hopefully).
On Thu Oct 2 18:31:25 2025 +0000, Tim Clem wrote:
Is there a reason to do this page by page? Can you not just do one mach_vm_protect for the whole range?
I tried explaining the rationale behind this in a comment in v2 at that spot, since it is not immediately obvious why it is done this way.
On Thu Oct 2 18:31:45 2025 +0000, Tim Clem wrote:
What's the point of making this memory entry?
See above.
Tim Clem (@tclem) commented about dlls/ntdll/unix/virtual.c:
- mach_msg_type_number_t info_count;
- mach_port_t object_name;
- vm_region_extended_info_data_t info;
- SIZE_T end = current_address + ROUND_SIZE( base, size, host_page_mask );
- kern_return_t kr;
- if (!(view->protect & VPROT_WRITEWATCH) || !use_kernel_writewatch) return;
- while (current_address < end)
- {
vm_prot_t mach_prot = get_mach_prot( current_address );
region_address = current_address;
info_count = VM_REGION_EXTENDED_INFO_COUNT;
kr = mach_vm_region( mach_task_self(), ®ion_address, ®ion_size, VM_REGION_EXTENDED_INFO,
(vm_region_info_t)&info, &info_count, &object_name );
Have you confirmed that the protection from these `vm_region_info_t`s is accurate under Rosetta? I seem to remember it reporting the actual underlying page protection that Mach is using, which in the case of wx pages might be different than how it acts - or was allocated - under Rosetta (since w|x isn't supported in native ARM and is emulated by Rosetta by handling the exceptions). So for instance for a MEM_WRITE_WATCH page with PAGE_EXECUTE_READWRITE protections, I imagine the assert below might fail.
On Fri Oct 3 17:02:49 2025 +0000, Tim Clem wrote:
Have you confirmed that the protection from these `vm_region_info_t`s is accurate under Rosetta? I seem to remember it reporting the actual underlying page protection that Mach is using, which in the case of wx pages might be different than how it acts - or was allocated - under Rosetta (since w|x isn't supported in native ARM and is emulated by Rosetta by handling the exceptions). So for instance for a MEM_WRITE_WATCH page with PAGE_EXECUTE_READWRITE protections, I imagine the assert below might fail.
IMO the major problem here is that while it greatly optimizes one part (access to watched pages) it at the same time greatly degrades the performance of watch query / reset part (and the latter is probably not savable within the API limitation; even if the query part could be made over bigger regions the rest part must still only reset the detected-dirty pages, or it will be missing dirty state with concurrent access). Which may of course still help some usage patterns but regress others. So the main question is whether this tradeoff is overall beneficial?
Have you confirmed that the protection from these `vm_region_info_t`s is accurate under Rosetta?
Yeah it is accurate (in process that is, since that is the Rosetta wrapped version).
Also modern dotnet with `DOTNET_EnableWriteXorExecute=0` hammers this exact use case quite hard and I haven't seen it fail there once.
I think some of the issues in combination with Rosetta and the current write watch implementation is that both Rosetta and Wine are flipping pages back and forth from RX to RW to implement their respective functionality and Rosetta doing a bit of an imperfect job sometimes in what it reports in the page fault handler.
Creating the vm shadow object here is happening fully on the kernel side of things and is also correctly visible cross-process (unlike Rosetta RWX reporting).
So the main question is whether this tradeoff is overall beneficial?
I asked that myself, and I believe it is; from what I have seen at least.
If there is an application which has actually degraded performance from this that might be problematic, but I haven't been able to find any yet.
This functionality could also easily be added behind an env var, but I believe it to be always correct, slightly faster in total with modern dotnet workloads, and as mentioned before potentially freeing up `VPROT_WRITEWATCH` for other purposes in the future.
Have you confirmed that the protection from these `vm_region_info_t`s is accurate under Rosetta?
Yeah it is accurate (in process that is, since that is the Rosetta wrapped version).
Ah ok, thanks for confirming - and come to think of it, I believe I was testing that behavior in the middle of a fault handler, when things are probably inconsistent.
So the main question is whether this tradeoff is overall beneficial?
I don't really know how to judge that. The access vs. query tradeoff does seem to balance out, roughly, and I would believe that it improves some .Net workflows. OTOH it is a little finicky in that it relies on some undocumented behavior.