On 4/3/22 04:35, Elaine Lefler wrote:
On 4/2/22 13:19, Rémi Bernon wrote:
(I personally, believe that the efficient C implementation should come first, so that any non-supported hardware will at least benefit from it)
I also think that it will be good to add more efficient C implementation first (it will also show if SSE2 implementation is really needed).
Thanks, Piotr
I can't speak definitively, because it looks a little different for every function. But, overwhelmingly, my experience has been that nothing will run measurably faster than byte-by-byte functions without using vector instructions. Because the bottleneck isn't CPU power, the bottleneck is memory access. Like I said, vectors were created specifically to solve this problem, and IME you won't find notable performance gains without using them.
Vectorized instructions and intrinsics is just a extension of the idea of using larger types to process more data at a time. You can already do that to some extend using standard C, and, if you write the code in a nice enough way, the compiler may even be able to understand the intent and extend it further with vectorized instructions when it believes it's useful.
Then it's always a matter of a trade-off between optimizing for the large data case vs optimizing for the small data case. The larger the building blocks you use, the more you will cripple the small data case, as you will need to carefully handle the data alignment and handle the border case.
For this specific memcmp case, I believe using larger data types and avoiding unnecessary branches, you can already improve the C code well enough.
Note that, especially for the functions which are supposed to stop their iteration early, you also need to consider whether buffers are always entirely valid and if you are allowed to larger chunks of data at a time. It seems to be the case for memcmp, but not for memchr for instance. [1]
[1] https://trust-in-soft.com/blog/2015/12/21/memcmp-requires-pointers-to-fully-...
Personally I think Jinoh's suggestion to find a compatible-licensed library and copy their code is best. Otherwise I sense this will become an endless circle of "do we really need it?" (yes, but this type of code is annoying to review) and Wine could benefit from using an implementation that's already widely-tested.
I personally don't like the idea at all. Copying from other lib code is just the best way to get code with no history and which no-one really understands the characteristics and the reasons behind it.
Like I said in another thread, the memcpy C code that's been adapted from glibc to msvcrt is IMHO a good example. It may very well be correct, but looking at it I'm simply unable to say that it is.
Maybe I'm unable to read code, but my first and only impression is that it's unnecessarily complex. I don't know why it is the way it is, probably for some obscure historical or specific target architecture optimization, and, if for some reason we need to optimize it further I would just be unable to without rewriting it entirely.
Cheers,