Otherwise using compiler intrinsics could be an option. I guess it would leave some performance on the table, by handing register allocation to the compiler, but maybe not that much?
I made some measurements, and it turns out that the performance difference is very small (less than 1%). I'll try to come up with a new version, although I have no idea how to integrate this into our build system. The SSE code would have to be in a separate .c file, which is compiled with `-msse`, but only on x86.
IIRC !9588 took care of that via a couple of `#ifdef`s, essentially only building the SSE version when -msse is included or implied in the CFLAGS (e.g. because of -march=nocona). -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10716#note_138723