 
            On Wed, Mar 30, 2022 at 11:46 AM Gabriel Ivăncescu gabrielopcode@gmail.com wrote:
Why not just copy pasting it from msvcrt since it's already done?
This is interesting. Mine is more efficient than the msvcrt version (at least for sse2), because the msvcrt version relies on unaligned moves whereas mine uses bit shifting to allow both src and dst to be aligned. Also, in my testing I found that it's most efficient to use plain old bytes copy for the first/last parts of the function, anything else just causes branch mispredictions.
It would also be optimal to write an AVX2 version of this function for systems that support it (which is most of them, nowadays). AVX2 has a 32-byte register and can move data a lot faster. It needs a cpuid check though. And probably not worth it for anything other than memcpy.
As for memset, while the code looks like it deals in aligned pointers, examining the output with objdump shows that GCC fails to produce any MOVDQA/MOVAPS/MOVNTDQ instructions. So it's not really doing what it claims to do. In order to emit those instructions you need to define an aligned 16-byte structure and use it in your pointers.
On Wed, Mar 30, 2022 at 10:34 AM Rémi Bernon rbernon@codeweavers.com wrote:
IIUC upstream isn't very interested in assembly optimized routine, unless really necessary.
I don't think hand-optimized asm is necessary here. Though the usage of _mm_ functions is pretty close to it. I think my routine can be refactored to be more platform-independent. The main annoyance is the fact that PSLLDQ/PSRLDQ only accept compile-time constants as shift values, meaning each of the fastcpy functions has slightly different machine code - hence the jump table. You really don't want conditional jumps inside that inner loop. For memcpy I think the annoyance is worth it, but I'd hesitate to use it elsewhere.
To be honest, I had planned to submit patches for several of these functions, including memset, and strlen. Most of these can be done without the need for special intrinsics, but the two-argument functions like strcmp get a lot more ugly. It might be best to create a separate .c file for platform-specific implementations. IMO, string.c is widely-used enough to justify extra complexity, but of course I'm not the project maintainer.
The duplication between msvcrt and ntdll seems strange. From what I can tell (may be wrong), most apps are probably using the msvcrt version, but it looks like Wine libraries run the ntdll version instead. Also, msvcrt seems to have some routines that are more optimized than ntdll, but memset and only memset has the optimized version in both. So it looks like I'm not the first person to miss this detail. Is it possible to remove the code from ntdll and only call msvcrt (or the other way around)? Or, worst case, have them both compile the same C file so we don't have the same function in two places?
- Elaine