On Sun, Apr 3, 2022 at 7:00 AM RĂ©mi Bernon rbernon@codeweavers.com wrote:
Vectorized instructions and intrinsics is just a extension of the idea of using larger types to process more data at a time. You can already do that to some extend using standard C, and, if you write the code in a nice enough way, the compiler may even be able to understand the intent and extend it further with vectorized instructions when it believes it's useful.
For this specific memcmp case, I believe using larger data types and avoiding unnecessary branches, you can already improve the C code well enough.
Out of curiosity, what data type would you expect an optimized C version to use? I'd think size_t is most obvious, but then you deal with the problem that it's a different number of bytes depending on the processor.
If anything, I'd say a standardized 16-byte data size is the cleaner solution, because you can write preprocessor macros to parse it differently depending on the native word size of the cpu, or use vector intrinsics when possible (note: ARM has these too, though I don't know how to write them offhand). Otherwise you'll have to write conditionals/for-loops and hope the compiler is smart enough to unroll them. Or you could pick uint32_t/uint64_t and use that all the time, but you run the risk of making the cpu angry, especially if you're doing any math.
As for compiler vectorization, GCC _can_ emit vector instructions, but it's not very smart. For instance, we have a function named memset_aligned_32, which writes aligned 32-byte aligned chunks. But GCC doesn't know that. It just writes regular quadword instructions. So that's some complicated code which isn't actually better than a straightforward uint64_t loop. I think that's the reason I prefer seeing intrinsics - granted, I have a lot of experience reading them, and I understand they're unfriendly to people who aren't familiar - but they give you assurance that the compiler actually works as expected.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
Bear in mind that most of these functions read single bytes, presently. So it can't get slower than it already is.
Note that, especially for the functions which are supposed to stop their iteration early, you also need to consider whether buffers are always entirely valid and if you are allowed to larger chunks of data at a time. It seems to be the case for memcmp, but not for memchr for instance. [1]
[1] https://trust-in-soft.com/blog/2015/12/21/memcmp-requires-pointers-to-fully-...
I'm curious whether or not this would crash on native windows. You'd have that problem even if writing a C optimized memcmp. Maybe the patch author should be required to test edge cases like this?
Like I said in another thread, the memcpy C code that's been adapted from glibc to msvcrt is IMHO a good example. It may very well be correct, but looking at it I'm simply unable to say that it is.
Maybe I'm unable to read code, but my first and only impression is that it's unnecessarily complex. I don't know why it is the way it is, probably for some obscure historical or specific target architecture optimization, and, if for some reason we need to optimize it further I would just be unable to without rewriting it entirely.
Yeah, agreed... that function is awful. I'd like to see code with more comments.