I'm not sure that going to those lengths to have a single asm version for both 32 and 64 bit x86 helps more than it hurts. Those defines make it work, but don't look super nice e.g. the first line in the asm for 64-bit ends up being `mov %ecx, %ecx`.
I was hoping the defines would also make the code easier to follow.
Just splitting the two versions apart might help.
I moved more code to C and made the argument reading code separate for 32- and 64-bit versions in v2. Hope this makes it more readable.
Otherwise using compiler intrinsics could be an option. I guess it would leave some performance on the table, by handing register allocation to the compiler, but maybe not that much?
I made some measurements, and it turns out that the performance difference is very small (less than 1%). I'll try to come up with a new version, although I have no idea how to integrate this into our build system. The SSE code would have to be in a separate .c file, which is compiled with `-msse`, but only on x86. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10716#note_138693