Can put the addition of NtFlushInstructionCache into a separate commit, while doing a bit of research on the differences between NtWriteVirtualMemory and WriteProcessMemory IIRC there was mention somewhere that in addition to changing the protection, it also flushes the instruction cache (but can try to see if I can find some more evidence on arm64 Windows for that).
Have been thinking about unconditionally changing the protection to be writeable and then writing, however that would be three Nt* operations versus two with querying first, for well-behaved applications. Here are some intra-process performance numbers in microseconds for a writable 1024 bytes region with this change (10000 iterations on x86_64):
| | native | wine | | ------ | ------ | ------ | |WriteProcessMemory|1.414879|22.287070| |NtWriteVirtualMemory|1.053687|19.914850|
Can add a few more tests as well, thanks for looking over it!