Signed-off-by: Jan Sikorski jsikorski@codeweavers.com --- It's about 13x faster on my machine than the byte version. memcmp performance is important to wined3d, where it's used to find pipelines in the cache, and the keys are pretty big. --- dlls/msvcrt/string.c | 46 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+)
diff --git a/dlls/msvcrt/string.c b/dlls/msvcrt/string.c index 3b352ac0bf2..66761e7a282 100644 --- a/dlls/msvcrt/string.c +++ b/dlls/msvcrt/string.c @@ -34,6 +34,10 @@ #include "wine/asm.h" #include "wine/debug.h"
+#ifdef __x86_64__ +#include <immintrin.h> +#endif + WINE_DEFAULT_DEBUG_CHANNEL(msvcrt);
/********************************************************************* @@ -2675,11 +2679,52 @@ int CDECL I10_OUTPUT(MSVCRT__LDOUBLE ld80, int prec, int flag, struct _I10_OUTPU } #undef I10_OUTPUT_MAX_PREC
+#ifdef __x86_64__ +static int sse2_memcmp(const void *p1, const void *p2, size_t size) +{ + const unsigned char *first = p1, *second = p2; + size_t remainder = size & 0xf; + size_t size_16 = size / 16; + uint16_t mask; + DWORD index; + + while (size_16-- > 0) + { + __m128i value_1 = _mm_loadu_si128((__m128i *)first); + __m128i value_2 = _mm_loadu_si128((__m128i *)second); + __m128i compare = _mm_cmpeq_epi8(value_1, value_2); + if ((mask = ~_mm_movemask_epi8(compare))) + { + _BitScanForward(&index, mask); + if (first[index] < second[index]) return -1; + else return 1; + } + + first += 16; + second += 16; + } + + while (remainder-- > 0) + { + if (*first < *second) return -1; + if (*first > *second) return 1; + + first++; + second++; + } + + return 0; +} +#endif + /********************************************************************* * memcmp (MSVCRT.@) */ int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n) { +#ifdef __x86_64__ + return sse2_memcmp(ptr1, ptr2, n); +#else const unsigned char *p1, *p2;
for (p1 = ptr1, p2 = ptr2; n; n--, p1++, p2++) @@ -2688,6 +2733,7 @@ int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n) if (*p1 > *p2) return 1; } return 0; +#endif }
#if defined(__i386__) || defined(__x86_64__)
On Fri, Apr 1, 2022 at 7:13 AM Jan Sikorski jsikorski@codeweavers.com wrote:
Signed-off-by: Jan Sikorski jsikorski@codeweavers.com
It's about 13x faster on my machine than the byte version. memcmp performance is important to wined3d, where it's used to find pipelines in the cache, and the keys are pretty big.
Should be noted that SSE2 also exists on 32-bit processors, and in this same file you can find usage of "sse2_supported", which would enable you to use this code path on i386. You can put __attribute__((target("sse2"))) on the declaration of sse2_memcmp to allow GCC to emit SSE2 instructions even when the file's architecture forbids it.
I think this could be even faster if you forced ptr1 to be aligned by byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't reasonably force-align both pointers, but aligning at least one should give measurably better performance.
I have a similar patch (labelled 230501 on https://source.winehq.org/patches/ - not sure how to link the whole discussion, sorry) which triggered a discussion about duplication between ntdll and msvcrt. memcmp is also a function that appears in both dlls. Do you have any input on that? (sorry if I'm out of line for butting in here. I just noticed we're working on the same basic thing)
- Elaine
Wouldn't it make much more sense if we simply copied optimized copy routines from other libc implementations? They have specialised implementations for various architectures and microarchitectures (e.g. cache line size), not to mention the performance enhancements that have accumulated over time.
Also worth noting is that Wine is licensed under LGPL, which makes it compatible with most open-source libcs out there. Basically what we would need is some ABI adaptations, such as calling convention adjustment and SEH.
Another option is to just call system libc routines directly, although in this case it might interfere with stack unwinding, clear PE/unix separation, and msvcrt hotpatching.
On Sat, Apr 2, 2022, 1:45 PM Elaine Lefler elaineclefler@gmail.com wrote:
On Fri, Apr 1, 2022 at 7:13 AM Jan Sikorski jsikorski@codeweavers.com wrote:
Signed-off-by: Jan Sikorski jsikorski@codeweavers.com
It's about 13x faster on my machine than the byte version. memcmp performance is important to wined3d, where it's used to find pipelines in the cache, and the keys are pretty big.
Should be noted that SSE2 also exists on 32-bit processors, and in this same file you can find usage of "sse2_supported", which would enable you to use this code path on i386. You can put __attribute__((target("sse2"))) on the declaration of sse2_memcmp to allow GCC to emit SSE2 instructions even when the file's architecture forbids it.
I think this could be even faster if you forced ptr1 to be aligned by byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't reasonably force-align both pointers, but aligning at least one should give measurably better performance.
I have a similar patch (labelled 230501 on https://source.winehq.org/patches/ - not sure how to link the whole discussion, sorry) which triggered a discussion about duplication between ntdll and msvcrt. memcmp is also a function that appears in both dlls. Do you have any input on that? (sorry if I'm out of line for butting in here. I just noticed we're working on the same basic thing)
- Elaine
On 4/2/22 12:51, Jin-oh Kang wrote:
Wouldn't it make much more sense if we simply copied optimized copy routines from other libc implementations? They have specialised implementations for various architectures and microarchitectures (e.g. cache line size), not to mention the performance enhancements that have accumulated over time.
The question is, do we really need and want the complexity induced by hand-crafted assembly (or intrinsics) routines?
* at build time but also runtime, we'll need to carefully check hardware capability,
* it increases maintenance burden as they may need to be updated when hardware performance profile changes, or when new features are added,
* other libc implementation may be hard to integrate in our code base, especially if they rely on some dispatch mechanism or assembly source,
Or do we want to rely as much as possible on the compiler to do it for us?
I don't know the rationale behind the choice of the other libc, but as far as I understand for Wine an efficient C implementation is usually preferred over assembly, unless a convincing argument is made that doing it in assembly significantly improves things for some applications.
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
Also worth noting is that Wine is licensed under LGPL, which makes it compatible with most open-source libcs out there. Basically what we would need is some ABI adaptations, such as calling convention adjustment and SEH.
Another option is to just call system libc routines directly, although in this case it might interfere with stack unwinding, clear PE/unix separation, and msvcrt hotpatching.
Calling the system libc will need a "syscall", and will most likely defeat any performance improvement it could bring.
On Sat, Apr 2, 2022, 1:45 PM Elaine Lefler elaineclefler@gmail.com wrote:
On Fri, Apr 1, 2022 at 7:13 AM Jan Sikorski jsikorski@codeweavers.com wrote:
Signed-off-by: Jan Sikorski jsikorski@codeweavers.com
It's about 13x faster on my machine than the byte version. memcmp performance is important to wined3d, where it's used to find pipelines in the cache, and the keys are pretty big.
Should be noted that SSE2 also exists on 32-bit processors, and in this same file you can find usage of "sse2_supported", which would enable you to use this code path on i386. You can put __attribute__((target("sse2"))) on the declaration of sse2_memcmp to allow GCC to emit SSE2 instructions even when the file's architecture forbids it.
I think this could be even faster if you forced ptr1 to be aligned by byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't reasonably force-align both pointers, but aligning at least one should give measurably better performance.
I have a similar patch (labelled 230501 on https://source.winehq.org/patches/ - not sure how to link the whole discussion, sorry) which triggered a discussion about duplication between ntdll and msvcrt. memcmp is also a function that appears in both dlls. Do you have any input on that? (sorry if I'm out of line for butting in here. I just noticed we're working on the same basic thing)
- Elaine
On 4/2/22 20:19, Rémi Bernon wrote:
On 4/2/22 12:51, Jin-oh Kang wrote:
Wouldn't it make much more sense if we simply copied optimized copy routines from other libc implementations? They have specialised implementations for various architectures and microarchitectures (e.g. cache line size), not to mention the performance enhancements that have accumulated over time.
The question is, do we really need and want the complexity induced by hand-crafted assembly (or intrinsics) routines?
- at build time but also runtime, we'll need to carefully check hardware capability,
We already do this for SSE2 on i386, and for FXSAVE/XSAVE/XSAVEC on both i386 and x86-64. At build time we can simply disable SSE/AVX routines on old enough GCC.
- it increases maintenance burden as they may need to be updated when hardware performance profile changes, or when new features are added,
As long as correctness and (any sort of) performance advantages are preserved, no further maintenance effort would be _strictly_ necessary.
We can set up performance regression tests for C vs. SIMD implementations, and possibly revert to C version if the gap ever becomes severe enough (which is unlikely).
- other libc implementation may be hard to integrate in our code base, especially if they rely on some dispatch mechanism or assembly source,
We only copy (and adapt) the implementation and _not_ its supporting infrastructure. Also we don't really have to do it for every string routine; we merely need to do so only for crucial ones.
Or do we want to rely as much as possible on the compiler to do it for us?
I don't know the rationale behind the choice of the other libc, but as far as I understand for Wine an efficient C implementation is usually preferred over assembly,
We may as well copy efficient C implementations from other libcs. It also avoids the 3 problems you've pointed out.
unless a convincing argument is made that doing it in assembly significantly improves things for some applications.
Most modern LIBCs (and presumably MSVCRT as well) using SSE/AVX is a convincing argument in and of itself.
I think what we're specifically asking here is whether "performance benefits reaped from optimizing string routines outweighs the maintenance burden imposed by the use of machine-specific instructions." Personally I haven't run into a case where msvcrt string routines shows up as a bottleneck in perf profiling, but others can chip in and share their numbers.
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
Also worth noting is that Wine is licensed under LGPL, which makes it compatible with most open-source libcs out there. Basically what we would need is some ABI adaptations, such as calling convention adjustment and SEH.
Another option is to just call system libc routines directly, although in this case it might interfere with stack unwinding, clear PE/unix separation, and msvcrt hotpatching.
Calling the system libc will need a "syscall", and will most likely defeat any performance improvement it could bring.
Yes. This is exactly what I meant by interference with/from "clear PE/unix separation," or lack thereof.
On Sat, Apr 2, 2022 at 4:51 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
Wouldn't it make much more sense if we simply copied optimized copy routines from other libc implementations? They have specialised implementations for various architectures and microarchitectures (e.g. cache line size), not to mention the performance enhancements that have accumulated over time.
I think this is a really good point.
Another option is to just call system libc routines directly, although in this case it might interfere with stack unwinding, clear PE/unix separation, and msvcrt hotpatching.
Also a good idea, but the problem is that Windows dlls expect Windows calling conventions. There's no way (at least none I can immediately find) of wrapping a call to the system library without crashing.
On Sat, Apr 2, 2022 at 5:19 AM Rémi Bernon rbernon@codeweavers.com wrote:
Calling the system libc will need a "syscall", and will most likely defeat any performance improvement it could bring.
I don't think that works either, since these functions live in an .so and not in the kernel. Now, if it were possible, the system libraries are _significantly_ faster than anything Wine offers (even with SSE2 optimizations), so I think their raw speed would make up for any overhead.
On Sat, Apr 2, 2022 at 7:24 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
As long as correctness and (any sort of) performance advantages are preserved, no further maintenance effort would be _strictly_ necessary.
Agree with this. It's not terribly difficult to prove their correctness. Once that's done you should never need to update them. A new architecture might introduce instructions that are even more performant, but I don't think it's conceivable that vector instructions would ever become slower than non-vectors. Doing so would cripple ~15 years of software development, nobody would buy a CPU that does that.
Here's how I see it: vector instructions were created specifically to solve this problem of operating on large regions of memory very quickly. Nearly every other program with similar requirements is either 1) Using these instructions, or 2) Relying on an external library that does so (note: that library is often msvcrt!). So I think Wine should do one of those two as well.
On Sat, Apr 2, 2022 at 8:59 AM Piotr Caban piotr.caban@gmail.com wrote:
On 4/2/22 13:19, Rémi Bernon wrote:
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
I also think that it will be good to add more efficient C implementation first (it will also show if SSE2 implementation is really needed).
Thanks, Piotr
I can't speak definitively, because it looks a little different for every function. But, overwhelmingly, my experience has been that nothing will run measurably faster than byte-by-byte functions without using vector instructions. Because the bottleneck isn't CPU power, the bottleneck is memory access. Like I said, vectors were created specifically to solve this problem, and IME you won't find notable performance gains without using them.
Now, we CAN use #ifdefs and preprocessor macros to define a fake __m128i on systems that don't natively support it. Then write emulation for each operation so that GCC can compile real vector instructions when possible (x86-64) and fallback to smaller types on systems without vector support. That way we'd avoid large vendor-specific code blocks. But you're not going to escape this idea of "we need to read large chunks and operate on them all at once".
Personally I think Jinoh's suggestion to find a compatible-licensed library and copy their code is best. Otherwise I sense this will become an endless circle of "do we really need it?" (yes, but this type of code is annoying to review) and Wine could benefit from using an implementation that's already widely-tested.
- Elaine
On Sun, Apr 3, 2022, 11:36 AM Elaine Lefler elaineclefler@gmail.com wrote:
On Sat, Apr 2, 2022 at 4:51 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
Wouldn't it make much more sense if we simply copied optimized copy
routines from other libc implementations? They have specialised implementations for various architectures and microarchitectures (e.g. cache line size), not to mention the performance enhancements that have accumulated over time.
I think this is a really good point.
Another option is to just call system libc routines directly, although
in this case it might interfere with stack unwinding, clear PE/unix separation, and msvcrt hotpatching.
Also a good idea, but the problem is that Windows dlls expect Windows calling conventions. There's no way (at least none I can immediately find) of wrapping a call to the system library without crashing.
It should of course move around argument registers and deal with caller/callee-saved registers; this is implied in "some ABI adaptations, such as calling convention adjustment and SEH."
On Sat, Apr 2, 2022 at 5:19 AM Rémi Bernon rbernon@codeweavers.com wrote:
Calling the system libc will need a "syscall", and will most likely defeat any performance improvement it could bring.
I don't think that works either, since these functions live in an .so and not in the kernel. Now, if it were possible, the system libraries are _significantly_ faster than anything Wine offers (even with SSE2 optimizations), so I think their raw speed would make up for any overhead.
It's not a real syscall per se; rather, it's more like a gate between the PE side (corresponding to Windows userspace) and the Unix side (Wine's pseudo kernel space which interacts directly with the host OS). The PE/Unix separation is designed so that every interaction with the system goes to the syscall gate, just like on Windows (we're not there yet, but we'll eventually). This helps satisfy video game anti-cheat technologies and conceal the Unix (.so) code which would otherwise cause confusion for Win32 apps and debuggers tracing the execution path.
On Sat, Apr 2, 2022 at 7:24 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
As long as correctness and (any sort of) performance advantages are
preserved, no further maintenance effort would be _strictly_ necessary.
Agree with this. It's not terribly difficult to prove their correctness. Once that's done you should never need to update them. A new architecture might introduce instructions that are even more performant, but I don't think it's conceivable that vector instructions would ever become slower than non-vectors. Doing so would cripple ~15 years of software development, nobody would buy a CPU that does that.
Here's how I see it: vector instructions were created specifically to solve this problem of operating on large regions of memory very quickly. Nearly every other program with similar requirements is either 1) Using these instructions, or 2) Relying on an external library that does so (note: that library is often msvcrt!). So I think Wine should do one of those two as well.
Also worth noting is that Wine already does it, with SSE2 memcpy and etc..
On Sat, Apr 2, 2022 at 8:59 AM Piotr Caban piotr.caban@gmail.com wrote:
On 4/2/22 13:19, Rémi Bernon wrote:
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from
it)
I also think that it will be good to add more efficient C implementation first (it will also show if SSE2 implementation is really needed).
Thanks, Piotr
I can't speak definitively, because it looks a little different for every function. But, overwhelmingly, my experience has been that nothing will run measurably faster than byte-by-byte functions without using vector instructions. Because the bottleneck isn't CPU power, the bottleneck is memory access.
It should be.
Like I said, vectors were created
specifically to solve this problem, and IME you won't find notable performance gains without using them.
I think Rémi is aware of that. However, optimization on C implementation is arguably much more universally applicable to a broader range of (micro-)architectures.
Now, we CAN use #ifdefs and preprocessor macros to define a fake __m128i on systems that don't natively support it. Then write emulation for each operation so that GCC can compile real vector instructions when possible (x86-64) and fallback to smaller types on systems without vector support. That way we'd avoid large vendor-specific code blocks. But you're not going to escape this idea of "we need to read large chunks and operate on them all at once".
What you're thinking of is a SIMD abstraction library. I don't see how it would be highly necessary, since we're okay with vendor-specific code blocks as long as they are justified. Note that we now only support 4 architectures (IA-32, x86-64, ARM AArch32, and ARM AArch64).
Personally I think Jinoh's suggestion to find a compatible-licensed library and copy their code is best. Otherwise I sense this will become an endless circle of "do we really need it?" (yes, but this type of code is annoying to review) and Wine could benefit from using an implementation that's already widely-tested.
- Elaine
On Sat, Apr 2, 2022 at 11:09 PM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
It's not a real syscall per se; rather, it's more like a gate between the PE side (corresponding to Windows userspace) and the Unix side (Wine's pseudo kernel space which interacts directly with the host OS). The PE/Unix separation is designed so that every interaction with the system goes to the syscall gate, just like on Windows (we're not there yet, but we'll eventually). This helps satisfy video game anti-cheat technologies and conceal the Unix (.so) code which would otherwise cause confusion for Win32 apps and debuggers tracing the execution path.
Ah. That makes sense. In this case I think Remi is correct that there's too much overhead.
I can't speak definitively, because it looks a little different for every function. But, overwhelmingly, my experience has been that nothing will run measurably faster than byte-by-byte functions without using vector instructions. Because the bottleneck isn't CPU power, the bottleneck is memory access.
It should be.
It's a margin of ~25%, versus a margin of ~500%. Unless you're moving gigabytes it's unlikely to be noticeable.
That said, another confounding issue is the fact that a large number of small movements will have very different performance characteristics from a small number of large movements. It's possible there are cases where using, say, dwords would be much faster than trying to vectorize. I haven't found them in testing, but this is another argument for using someone else's code rather than trying to roll our own - because a library dedicated to this purpose has likely done all kinds of profiling to find exactly where that threshold lies.
What you're thinking of is a SIMD abstraction library. I don't see how it would be highly necessary, since we're okay with vendor-specific code blocks as long as they are justified. Note that we now only support 4 architectures (IA-32, x86-64, ARM AArch32, and ARM AArch64).
Right. The reason I bring it up is because it would satisfy the requirement to be portable (as long as you stick to the abstraction library, you're writing regular C) and would get you close enough to the performance of real intrinsics that it should leave no need for inline asm. So if we don't want to import another library, this may be the best compromise between speed and simplicity.
On 4/3/22 04:35, Elaine Lefler wrote:
On 4/2/22 13:19, Rémi Bernon wrote:
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
I also think that it will be good to add more efficient C implementation first (it will also show if SSE2 implementation is really needed).
Thanks, Piotr
I can't speak definitively, because it looks a little different for every function. But, overwhelmingly, my experience has been that nothing will run measurably faster than byte-by-byte functions without using vector instructions. Because the bottleneck isn't CPU power, the bottleneck is memory access. Like I said, vectors were created specifically to solve this problem, and IME you won't find notable performance gains without using them.
Vectorized instructions and intrinsics is just a extension of the idea of using larger types to process more data at a time. You can already do that to some extend using standard C, and, if you write the code in a nice enough way, the compiler may even be able to understand the intent and extend it further with vectorized instructions when it believes it's useful.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
For this specific memcmp case, I believe using larger data types and avoiding unnecessary branches, you can already improve the C code well enough.
Note that, especially for the functions which are supposed to stop their iteration early, you also need to consider whether buffers are always entirely valid and if you are allowed to larger chunks of data at a time. It seems to be the case for memcmp, but not for memchr for instance. [1]
[1] https://trust-in-soft.com/blog/2015/12/21/memcmp-requires-pointers-to-fully-...
Personally I think Jinoh's suggestion to find a compatible-licensed library and copy their code is best. Otherwise I sense this will become an endless circle of "do we really need it?" (yes, but this type of code is annoying to review) and Wine could benefit from using an implementation that's already widely-tested.
I personally don't like the idea at all. Copying from other lib code is just the best way to get code with no history and which no-one really understands the characteristics and the reasons behind it.
Like I said in another thread, the memcpy C code that's been adapted from glibc to msvcrt is IMHO a good example. It may very well be correct, but looking at it I'm simply unable to say that it is.
Maybe I'm unable to read code, but my first and only impression is that it's unnecessarily complex. I don't know why it is the way it is, probably for some obscure historical or specific target architecture optimization, and, if for some reason we need to optimize it further I would just be unable to without rewriting it entirely.
Cheers,
On Sun, Apr 3, 2022 at 7:00 AM Rémi Bernon rbernon@codeweavers.com wrote:
Vectorized instructions and intrinsics is just a extension of the idea of using larger types to process more data at a time. You can already do that to some extend using standard C, and, if you write the code in a nice enough way, the compiler may even be able to understand the intent and extend it further with vectorized instructions when it believes it's useful.
For this specific memcmp case, I believe using larger data types and avoiding unnecessary branches, you can already improve the C code well enough.
Out of curiosity, what data type would you expect an optimized C version to use? I'd think size_t is most obvious, but then you deal with the problem that it's a different number of bytes depending on the processor.
If anything, I'd say a standardized 16-byte data size is the cleaner solution, because you can write preprocessor macros to parse it differently depending on the native word size of the cpu, or use vector intrinsics when possible (note: ARM has these too, though I don't know how to write them offhand). Otherwise you'll have to write conditionals/for-loops and hope the compiler is smart enough to unroll them. Or you could pick uint32_t/uint64_t and use that all the time, but you run the risk of making the cpu angry, especially if you're doing any math.
As for compiler vectorization, GCC _can_ emit vector instructions, but it's not very smart. For instance, we have a function named memset_aligned_32, which writes aligned 32-byte aligned chunks. But GCC doesn't know that. It just writes regular quadword instructions. So that's some complicated code which isn't actually better than a straightforward uint64_t loop. I think that's the reason I prefer seeing intrinsics - granted, I have a lot of experience reading them, and I understand they're unfriendly to people who aren't familiar - but they give you assurance that the compiler actually works as expected.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
Bear in mind that most of these functions read single bytes, presently. So it can't get slower than it already is.
Note that, especially for the functions which are supposed to stop their iteration early, you also need to consider whether buffers are always entirely valid and if you are allowed to larger chunks of data at a time. It seems to be the case for memcmp, but not for memchr for instance. [1]
[1] https://trust-in-soft.com/blog/2015/12/21/memcmp-requires-pointers-to-fully-...
I'm curious whether or not this would crash on native windows. You'd have that problem even if writing a C optimized memcmp. Maybe the patch author should be required to test edge cases like this?
Like I said in another thread, the memcpy C code that's been adapted from glibc to msvcrt is IMHO a good example. It may very well be correct, but looking at it I'm simply unable to say that it is.
Maybe I'm unable to read code, but my first and only impression is that it's unnecessarily complex. I don't know why it is the way it is, probably for some obscure historical or specific target architecture optimization, and, if for some reason we need to optimize it further I would just be unable to without rewriting it entirely.
Yeah, agreed... that function is awful. I'd like to see code with more comments.
On 4/5/22 12:12, Elaine Lefler wrote:
On Sun, Apr 3, 2022 at 7:00 AM Rémi Bernon rbernon@codeweavers.com wrote:
Vectorized instructions and intrinsics is just a extension of the idea of using larger types to process more data at a time. You can already do that to some extend using standard C, and, if you write the code in a nice enough way, the compiler may even be able to understand the intent and extend it further with vectorized instructions when it believes it's useful.
For this specific memcmp case, I believe using larger data types and avoiding unnecessary branches, you can already improve the C code well enough.
Out of curiosity, what data type would you expect an optimized C version to use? I'd think size_t is most obvious, but then you deal with the problem that it's a different number of bytes depending on the processor.
If anything, I'd say a standardized 16-byte data size is the cleaner solution, because you can write preprocessor macros to parse it
Did you mean "maximum vector register size supported by the processor?" Note that AVX2 has 32-byte registers, and AVX512 takes it further with 64-byte ones. Also worth noting is that ARM64 has Scalable Vector Extensions, which do *not* define the vector size at all. It leaves it to the processor implementation, allowing for seamless vector size expansion.
differently depending on the native word size of the cpu, or use vector intrinsics when possible (note: ARM has these too, though I don't know how to write them offhand). Otherwise you'll have to write conditionals/for-loops and hope the compiler is smart enough to unroll them. Or you could pick uint32_t/uint64_t and use that all the time, but you run the risk of making the cpu angry, especially if you're doing any math.
As for compiler vectorization, GCC _can_ emit vector instructions, but
I don't think we're going to use -O3 anytime soon.
There's "-ftree-vectorize" but I'm not sure if it exists on Clang.
it's not very smart. For instance, we have a function named memset_aligned_32, which writes aligned 32-byte aligned chunks. But GCC doesn't know that. It just writes regular quadword instructions.
Wouldn't __attribute__((aligned(32))) work?
So that's some complicated code which isn't actually better than a straightforward uint64_t loop. I think that's the reason I prefer seeing intrinsics - granted, I have a lot of experience reading them, and I understand they're unfriendly to people who aren't familiar - but they give you assurance that the compiler actually works as expected.
I think writing assembly directly is still best for performance, since we can control instruction scheduling that way.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
Bear in mind that most of these functions read single bytes, presently. So it can't get slower than it already is.
Well, it *does* get slower due to extra branches if the caller makes frequent calls to mem* functions with small data size (e.g. 1-15).
Note that, especially for the functions which are supposed to stop their iteration early, you also need to consider whether buffers are always entirely valid and if you are allowed to larger chunks of data at a time. It seems to be the case for memcmp, but not for memchr for instance. [1]
[1] https://trust-in-soft.com/blog/2015/12/21/memcmp-requires-pointers-to-fully-...
I'm curious whether or not this would crash on native windows. You'd have that problem even if writing a C optimized memcmp. Maybe the patch author should be required to test edge cases like this?
It might be helpful to pass partially mapped buffers to the function and see exactly where the access violation happens.
Like I said in another thread, the memcpy C code that's been adapted from glibc to msvcrt is IMHO a good example. It may very well be correct, but looking at it I'm simply unable to say that it is.
Maybe I'm unable to read code, but my first and only impression is that it's unnecessarily complex. I don't know why it is the way it is, probably for some obscure historical or specific target architecture optimization, and, if for some reason we need to optimize it further I would just be unable to without rewriting it entirely.
Yeah, agreed... that function is awful. I'd like to see code with more comments.
Here's how it works:
1. The dst - src >= n checks if either A. dst precedes src (dst < src), or B. the intervals [dst, dst+n) and [src, src+n) do not overlap each other (|dst - src| >= n). The comparison takes advantage of unsigned integer overflow for condition A, which allows to test both conditions with only single comparison. 2. In the forward copy case, the function first copies a few heading bytes until dst is word-aligned. 3. The function checks if src is also word-aligned yet. If it is, the function performs word-by-word copy. 4. Otherwise, the function still avoids unaligned access. This is achieved by doing aligned load of two consecutive words that contain the unaligned source block, and bit-shifting/merging the loaded words so that the desired part is copied into the destination. The loop continually keeps double-word window of source buffer until the (aligned) end. 5. Lastly, the trailing bytes are copied. 6. Steps 2-5 are done similarly in the reverse direction case.
That said, it does need more work on its verbose pointer arithmetic...
On 4/2/22 13:19, Rémi Bernon wrote:
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
I also think that it will be good to add more efficient C implementation first (it will also show if SSE2 implementation is really needed).
Thanks, Piotr
Hello everyone,
On 2 Apr 2022, at 06:44, Elaine Lefler elaineclefler@gmail.com wrote:
Should be noted that SSE2 also exists on 32-bit processors, and in this same file you can find usage of "sse2_supported", which would enable you to use this code path on i386. You can put __attribute__((target("sse2"))) on the declaration of sse2_memcmp to allow GCC to emit SSE2 instructions even when the file's architecture forbids it.
True, I intentionally left it out in this patch, because it’s possibly more compiler dependent.
I think this could be even faster if you forced ptr1 to be aligned by byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't reasonably force-align both pointers, but aligning at least one should give measurably better performance.
Right, this memcmp isn’t really an optimized routine, it was not supposed to be. It’s just to get baseline reasonable performance with the simplest possible code. More careful optimizations can follow.
memcmp is also a function that appears in both dlls. Do you have any input on that?
Not really, I don’t know who uses the one in ntdll and how much they care about speed. Just copy it if needed?
On 2 Apr 2022, at 13:19, Rémi Bernon rbernon@codeweavers.com wrote:
The question is, do we really need and want the complexity induced by hand-crafted assembly (or intrinsics) routines?
I’d argue there’s not much complexity here.
- at build time but also runtime, we'll need to carefully check hardware capability,
AMD64 has SSE2. There’s nothing to carefully check?
- it increases maintenance burden as they may need to be updated when hardware performance profile changes, or when new features are added,
If we want to be on top in terms of speed, yes, but that’s the work—what else to do?
Or do we want to rely as much as possible on the compiler to do it for us?
I hope not. I mean, if the compiler included its own, known good, mem* intrinsics, sure, but I wouldn’t count on it recognizing patterns in C code, unless we want to deal with compiler regressions as well.
On 3 Apr 2022, at 14:59, Rémi Bernon rbernon@codeweavers.com wrote:
Vectorized instructions and intrinsics is just a extension of the idea of using larger types to process more data at a time. You can already do that to some extend using standard C, and, if you write the code in a nice enough way, the compiler may even be able to understand the intent and extend it further with vectorized instructions when it believes it's useful.
Same as above, no thanks to relying on compiler smartness.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
I’d say if a program is bottlenecked by tiny memcmp’s it’s the program that’s slow, not the memcmp. That’s what you generally get for dealing with many little things one at a time.
On 2 Apr 2022, at 17:00, Piotr Caban piotr.caban@gmail.com wrote:
On 4/2/22 13:19, Rémi Bernon wrote:
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
I also think that it will be good to add more efficient C implementation first (it will also show if SSE2 implementation is really needed).
It wouldn’t hurt, sure. For speed, we probably can't expect more than 8x by comparing 8 bytes at a time, which is less than the 13x I measured. And I want it to be reasonably fast for wined3d. Besides, a platform independent C version wouldn’t be any simpler--barring Intel’s wonderful naming, I’d say the code is about as trivial as it can get. I don’t think a plain C version would be useful, but if that’s the bar for getting it done, so be it..
- Jan
On Tue, Apr 5, 2022 at 2:14 AM Jan Sikorski jsikorski@codeweavers.com wrote:
Hello everyone,
On 2 Apr 2022, at 06:44, Elaine Lefler elaineclefler@gmail.com wrote:
Should be noted that SSE2 also exists on 32-bit processors, and in this same file you can find usage of "sse2_supported", which would enable you to use this code path on i386. You can put __attribute__((target("sse2"))) on the declaration of sse2_memcmp to allow GCC to emit SSE2 instructions even when the file's architecture forbids it.
True, I intentionally left it out in this patch, because it’s possibly more compiler dependent.
AFAIK this dll will only ever be compiled with mingw-gcc. Should be safe to assume GCC unless there are plans to support other cross-compilers.
I think this could be even faster if you forced ptr1 to be aligned by byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't reasonably force-align both pointers, but aligning at least one should give measurably better performance.
Right, this memcmp isn’t really an optimized routine, it was not supposed to be. It’s just to get baseline reasonable performance with the simplest possible code. More careful optimizations can follow.
Based on the discussion in these threads, it looks like there's a lot of inertia in merging patches like this, so it's probably best to be as optimal as possible on the first go. Just my two cents.
memcmp is also a function that appears in both dlls. Do you have any input on that?
Not really, I don’t know who uses the one in ntdll and how much they care about speed. Just copy it if needed?
Copying it is ok, although I'm concerned about the code duplication. If you delete the implementation from msvcrt (which links to ntdll) and then put the optimized version in ntdll, it should result in msvcrt using the ntdll implementation, which removes the duplication and gives the optimized code to both dlls.
Besides, a platform independent C version wouldn’t be any simpler--barring Intel’s wonderful naming, I’d say the code is about as trivial as it can get. I don’t think a plain C version would be useful, but if that’s the bar for getting it done, so be it..
+1 on this. I don't think a C version would be any simpler.
On Wed, Apr 6, 2022 at 6:02 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
If anything, I'd say a standardized 16-byte data size is the cleaner solution, because you can write preprocessor macros to parse it
Did you mean "maximum vector register size supported by the processor?" Note that AVX2 has 32-byte registers, and AVX512 takes it further with 64-byte ones. Also worth noting is that ARM64 has Scalable Vector Extensions, which do *not* define the vector size at all. It leaves it to the processor implementation, allowing for seamless vector size expansion.
You can't safely run AVX2 instructions without checking the cpu first. AVX512 is a waste of time imo - Intel actually stopped supporting it on Alder Lake, it's complicated and developers don't like to work with it. AMD has never supported AVX512 at all.
It's safe to assume any 64-bit cpu will support 16-byte vectors. If you want something bigger then you need to write CPU-specific code.
it's not very smart. For instance, we have a function named memset_aligned_32, which writes aligned 32-byte aligned chunks. But GCC doesn't know that. It just writes regular quadword instructions.
Wouldn't __attribute__((aligned(32))) work?
I think it needs to be enclosed in a struct as well. I use objdump after compiling to make sure the expected instructions are present.
So that's some complicated code which isn't actually better than a straightforward uint64_t loop. I think that's the reason I prefer seeing intrinsics - granted, I have a lot of experience reading them, and I understand they're unfriendly to people who aren't familiar - but they give you assurance that the compiler actually works as expected.
I think writing assembly directly is still best for performance, since we can control instruction scheduling that way.
IME, it's almost impossible to hand-write ASM that outperforms a compiler. You might have to rearrange the C code a bit, but well-written intrinsics are usually just as good (within margin of error) as ASM, and much more flexible.
When writing high-performance code it's usually necessary to try multiple variations in order to find the fastest path. You can't easily do that with ASM, and it leads to incomplete optimization. For instance, I was able to write a C memcpy function that outperforms msvcrt's hand-written assembly, for the following reasons: - The ASM version has too many branches for small copies, branch misprediction is a huge source of latency. - The ASM version puts the main loop in the middle of the function, leading to a very large jump when writing small data. GCC puts it at the end, which is more optimal (basically, a large copy can afford to eat the lag from the jump, but a small copy can't). - When copying large data it's better to force alignment on both src and dst, even though you have to do some math. - Stores should be done with MOVNTDQ rather than MOVDQA. MOVNTDQ avoids cache evictions so you don't have to refill the entire cache pool after memcpy returns. Note that this would have been an easy one-line change even for ASM, but I suspect the developer was too exhausted to experiment.
So I'm strongly opposed to ASM unless a C equivalent is completely impossible. The code that a developer _thinks_ will be fast and the code that is _actually_ fast are often not the same thing. C makes it much easier to tweak.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
Bear in mind that most of these functions read single bytes, presently. So it can't get slower than it already is.
Well, it *does* get slower due to extra branches if the caller makes frequent calls to mem* functions with small data size (e.g. 1-15).
True. I think I mentioned earlier that it's usually faster for small copies to go byte by byte instead of minmaxing the data size. This is the reason why.
On Thu, Apr 7, 2022, 10:56 AM Elaine Lefler elaineclefler@gmail.com wrote:
On Wed, Apr 6, 2022 at 6:02 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
So that's some complicated code which isn't actually better than a straightforward uint64_t loop. I think that's the reason I prefer seeing intrinsics - granted, I have a lot of experience reading them, and I understand they're unfriendly to people who aren't familiar - but they give you assurance that the compiler actually works as expected.
I think writing assembly directly is still best for performance, since
we can control instruction scheduling that way.
IME, it's almost impossible to hand-write ASM that outperforms a compiler. You might have to rearrange the C code a bit, but well-written intrinsics are usually just as good (within margin of error) as ASM, and much more flexible.
Perhaps I've misphrased myself here. Note that "direct assembly != completely hand-written assembly." It's a bold claim that a *human* could outperform a compiler in machine code optimization in the first place. I said we should stick to assembler because instruction scheduling is more predictable across compilers that way, *not* because a human could do better at scheduling. We can take the assembly output from the best of the compilers and do whatever we please on it. (That's even how it's usually done!) This will bring the optimization work to much older and/or less capable compilers, since we're not relying on the user's compiler's performance. Note that Wine still supports GCC 4.x. Also, future compiler regressions may affect the performance of the optimized code (as Jan puts it).
llvm-mca simulates CPU pipeline and shows how well your code would perform on a superscalar architecture. Perhaps we can use that as well.
When writing high-performance code it's usually necessary to try multiple variations in order to find the fastest path. You can't easily do that with ASM, and it leads to incomplete optimization.
Yeah, we can first write the first version in C with intrinsics, look at differences between outputs of serveral compilers, and choose the best one.
For
instance, I was able to write a C memcpy function that outperforms msvcrt's hand-written assembly, for the following reasons:
- The ASM version has too many branches for small copies, branch
misprediction is a huge source of latency.
- The ASM version puts the main loop in the middle of the function,
leading to a very large jump when writing small data. GCC puts it at the end, which is more optimal (basically, a large copy can afford to eat the lag from the jump, but a small copy can't).
- When copying large data it's better to force alignment on both src
and dst, even though you have to do some math.
- Stores should be done with MOVNTDQ rather than MOVDQA. MOVNTDQ
avoids cache evictions so you don't have to refill the entire cache pool after memcpy returns. Note that this would have been an easy one-line change even for ASM, but I suspect the developer was too exhausted to experiment.
These improvements (except code placement for I-cache utilization) have nothing to do with compiler optimization. The programmer can make these mistakes either way (ASM or C).
So I'm strongly opposed to ASM unless a C equivalent is completely impossible. The code that a developer _thinks_ will be fast and the code that is _actually_ fast are often not the same thing. C makes it much easier to tweak.
Or rather harder to tweak, since code arrangement is not something programmer has control over.
On Wed, Apr 6, 2022 at 8:28 PM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
Perhaps I've misphrased myself here. Note that "direct assembly != completely hand-written assembly." It's a bold claim that a *human* could outperform a compiler in machine code optimization in the first place. I said we should stick to assembler because instruction scheduling is more predictable across compilers that way, *not* because a human could do better at scheduling. We can take the assembly output from the best of the compilers and do whatever we please on it. (That's even how it's usually done!) This will bring the optimization work to much older and/or less capable compilers, since we're not relying on the user's compiler's performance.
llvm-mca simulates CPU pipeline and shows how well your code would perform on a superscalar architecture. Perhaps we can use that as well.
Yeah, we can first write the first version in C with intrinsics, look at differences between outputs of serveral compilers, and choose the best one.
Fair points. Although inline ASM is still a pain for code reviewers.
I have noticed that Clang produces much better instructions than GCC. In my case I solved this problem by rewriting the C code with different intrinsics, so now it's basically the same on both compilers. It's definitely worth _looking_ at the output from multiple compilers, but I'm not sure how often (if ever) this can't be solved by rewriting your C code.
Note that Wine still supports GCC 4.x. Also, future compiler regressions may affect the performance of the optimized code (as Jan puts it).
True, regressions happen, but that could affect any code, not just the optimized stuff.
In my mind, if you're compiling Wine with a very old GCC, and it performs poorly as a result, that's a "you" problem. Binary packagers should be using more modern compilers. You're not going to have a good experience running the latest AAA games on a system that's too old for GCC 11, even if it's theoretically possible.
These improvements (except code placement for I-cache utilization) have nothing to do with compiler optimization. The programmer can make these mistakes either way (ASM or C).
You're correct. My point is that it's much harder to _fix_ these mistakes when writing ASM. And code that is difficult to fix is often left untouched.
So I'm strongly opposed to ASM unless a C equivalent is completely impossible. The code that a developer _thinks_ will be fast and the code that is _actually_ fast are often not the same thing. C makes it much easier to tweak.
Or rather harder to tweak, since code arrangement is not something programmer has control over.
In most cases, performance is dictated by memory access patterns rather than instruction arrangement. Compilers generally won't mess with that. CPUs don't execute instructions in the order they're written, but rather in the order that the CPU's microcode believes to be fastest.
As a developer, you mainly want to avoid dependency chains (i.e. code that requires a strict "A, then B, then C" order of operations) but you can do that in either C or in ASM. Branches are also a bottleneck, but those are hint-able, and even an old compiler likely understands them better than a human.
Also, by the same token as "compiler regressions could hamper performance", compiler improvements could produce better code, with no need for more developer work. Whereas pinning the ASM requires checking it again in the future to see if code generation has improved.
To be clear, I don't think it's impossible for inline ASM to be superior, I just think it's unlikely to be worth the effort.
On Wed, 6 Apr 2022, Elaine Lefler wrote:
On Tue, Apr 5, 2022 at 2:14 AM Jan Sikorski jsikorski@codeweavers.com wrote:
Hello everyone,
On 2 Apr 2022, at 06:44, Elaine Lefler elaineclefler@gmail.com wrote:
Should be noted that SSE2 also exists on 32-bit processors, and in this same file you can find usage of "sse2_supported", which would enable you to use this code path on i386. You can put __attribute__((target("sse2"))) on the declaration of sse2_memcmp to allow GCC to emit SSE2 instructions even when the file's architecture forbids it.
True, I intentionally left it out in this patch, because it’s possibly more compiler dependent.
AFAIK this dll will only ever be compiled with mingw-gcc. Should be safe to assume GCC unless there are plans to support other cross-compilers.
Clang is also supported as cross compiler - both with mingw targets and msvc targets.
// Martin