This removes 20 `movaps` instructions from every syscall that calls a sysv_abi function, plus an `and` for stack alignment and some other instructions depending on the function.
In `NtAllocateLocallyUniqueId` for example this reduces the number of instructions from 63 to 36. I don't entirely understand the llvm-mca output but here are the before and after stats that it outputs for that function:
Before
Iterations: 100 Instructions: 6300 Total Cycles: 3335 Total uOps: 6300
Dispatch Width: 6 uOps Per Cycle: 1.89 IPC: 1.89 Block RThroughput: 15.0
After
Iterations: 100 Instructions: 3600 Total Cycles: 1514 Total uOps: 3600
Dispatch Width: 6 uOps Per Cycle: 2.38 IPC: 2.38 Block RThroughput: 6.0
This currently depends on the stack being aligned by the syscall dispatcher, which afaict is the case if `sizeof(struct syscall_frame) % 16 == 0`. If that is not good enough I can add an `andq $~15,%rsp` somewhere.
One question I have is whether we want to continue supporting CDECL syscalls (only `wine_server_call`, `wine_server_fd_to_handle` and `wine_server_handle_to_fd`)? If we do, this adds a bit of complexity to the syscall dispatcher, see the commit "FIXUP ntdll: Support CDECL syscalls." If we don't, and make those syscalls WINAPI instead, then for every call to those functions on x86 it seems to either change nothing or add one `add` instruction. However we of course lose the ability to make CDECL syscalls.
-- v2: Revert "ntdll: Make CDECL syscalls WINAPI instead." FIXUP ntdll: Support CDECL syscalls. ntdll: Make syscall functions sysv_abi on x64. ntdll: Make CDECL syscalls WINAPI instead. win32u: Make syscalls use the SYSCALL calling convention. ntdll: Make syscalls use the SYSCALL calling convention. include: Add SYSCALL calling convention.