On Tue, Apr 5, 2022 at 2:14 AM Jan Sikorski jsikorski@codeweavers.com wrote:
Hello everyone,
On 2 Apr 2022, at 06:44, Elaine Lefler elaineclefler@gmail.com wrote:
Should be noted that SSE2 also exists on 32-bit processors, and in this same file you can find usage of "sse2_supported", which would enable you to use this code path on i386. You can put __attribute__((target("sse2"))) on the declaration of sse2_memcmp to allow GCC to emit SSE2 instructions even when the file's architecture forbids it.
True, I intentionally left it out in this patch, because it’s possibly more compiler dependent.
AFAIK this dll will only ever be compiled with mingw-gcc. Should be safe to assume GCC unless there are plans to support other cross-compilers.
I think this could be even faster if you forced ptr1 to be aligned by byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't reasonably force-align both pointers, but aligning at least one should give measurably better performance.
Right, this memcmp isn’t really an optimized routine, it was not supposed to be. It’s just to get baseline reasonable performance with the simplest possible code. More careful optimizations can follow.
Based on the discussion in these threads, it looks like there's a lot of inertia in merging patches like this, so it's probably best to be as optimal as possible on the first go. Just my two cents.
memcmp is also a function that appears in both dlls. Do you have any input on that?
Not really, I don’t know who uses the one in ntdll and how much they care about speed. Just copy it if needed?
Copying it is ok, although I'm concerned about the code duplication. If you delete the implementation from msvcrt (which links to ntdll) and then put the optimized version in ntdll, it should result in msvcrt using the ntdll implementation, which removes the duplication and gives the optimized code to both dlls.
Besides, a platform independent C version wouldn’t be any simpler--barring Intel’s wonderful naming, I’d say the code is about as trivial as it can get. I don’t think a plain C version would be useful, but if that’s the bar for getting it done, so be it..
+1 on this. I don't think a C version would be any simpler.
On Wed, Apr 6, 2022 at 6:02 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
If anything, I'd say a standardized 16-byte data size is the cleaner solution, because you can write preprocessor macros to parse it
Did you mean "maximum vector register size supported by the processor?" Note that AVX2 has 32-byte registers, and AVX512 takes it further with 64-byte ones. Also worth noting is that ARM64 has Scalable Vector Extensions, which do *not* define the vector size at all. It leaves it to the processor implementation, allowing for seamless vector size expansion.
You can't safely run AVX2 instructions without checking the cpu first. AVX512 is a waste of time imo - Intel actually stopped supporting it on Alder Lake, it's complicated and developers don't like to work with it. AMD has never supported AVX512 at all.
It's safe to assume any 64-bit cpu will support 16-byte vectors. If you want something bigger then you need to write CPU-specific code.
it's not very smart. For instance, we have a function named memset_aligned_32, which writes aligned 32-byte aligned chunks. But GCC doesn't know that. It just writes regular quadword instructions.
Wouldn't __attribute__((aligned(32))) work?
I think it needs to be enclosed in a struct as well. I use objdump after compiling to make sure the expected instructions are present.
So that's some complicated code which isn't actually better than a straightforward uint64_t loop. I think that's the reason I prefer seeing intrinsics - granted, I have a lot of experience reading them, and I understand they're unfriendly to people who aren't familiar - but they give you assurance that the compiler actually works as expected.
I think writing assembly directly is still best for performance, since we can control instruction scheduling that way.
IME, it's almost impossible to hand-write ASM that outperforms a compiler. You might have to rearrange the C code a bit, but well-written intrinsics are usually just as good (within margin of error) as ASM, and much more flexible.
When writing high-performance code it's usually necessary to try multiple variations in order to find the fastest path. You can't easily do that with ASM, and it leads to incomplete optimization. For instance, I was able to write a C memcpy function that outperforms msvcrt's hand-written assembly, for the following reasons: - The ASM version has too many branches for small copies, branch misprediction is a huge source of latency. - The ASM version puts the main loop in the middle of the function, leading to a very large jump when writing small data. GCC puts it at the end, which is more optimal (basically, a large copy can afford to eat the lag from the jump, but a small copy can't). - When copying large data it's better to force alignment on both src and dst, even though you have to do some math. - Stores should be done with MOVNTDQ rather than MOVDQA. MOVNTDQ avoids cache evictions so you don't have to refill the entire cache pool after memcpy returns. Note that this would have been an easy one-line change even for ASM, but I suspect the developer was too exhausted to experiment.
So I'm strongly opposed to ASM unless a C equivalent is completely impossible. The code that a developer _thinks_ will be fast and the code that is _actually_ fast are often not the same thing. C makes it much easier to tweak.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
Bear in mind that most of these functions read single bytes, presently. So it can't get slower than it already is.
Well, it *does* get slower due to extra branches if the caller makes frequent calls to mem* functions with small data size (e.g. 1-15).
True. I think I mentioned earlier that it's usually faster for small copies to go byte by byte instead of minmaxing the data size. This is the reason why.