Re: [PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

6 Apr 2022


      On 4/5/22 12:12, Elaine Lefler wrote:
...
On Sun, Apr 3, 2022 at 7:00 AM Rémi Bernon rbernon@codeweavers.com wrote:
...
Vectorized instructions and intrinsics is just a extension of the idea
of using larger types to process more data at a time. You can already do
that to some extend using standard C, and, if you write the code in a
nice enough way, the compiler may even be able to understand the intent
and extend it further with vectorized instructions when it believes it's
useful.
...
For this specific memcmp case, I believe using larger data types and
avoiding unnecessary branches, you can already improve the C code well
enough.
Out of curiosity, what data type would you expect an optimized C
version to use? I'd think size_t is most obvious, but then you deal
with the problem that it's a different number of bytes depending on
the processor.
If anything, I'd say a standardized 16-byte data size is the cleaner
solution, because you can write preprocessor macros to parse it
Did you mean "maximum vector register size supported by the processor?"
Note that AVX2 has 32-byte registers, and AVX512 takes it further with 64-byte ones.
Also worth noting is that ARM64 has Scalable Vector Extensions, which do *not* define the vector size at all. It leaves it to the processor implementation, allowing for seamless vector size expansion.
...
differently depending on the native word size of the cpu, or use
vector intrinsics when possible (note: ARM has these too, though I
don't know how to write them offhand). Otherwise you'll have to write
conditionals/for-loops and hope the compiler is smart enough to unroll
them. Or you could pick uint32_t/uint64_t and use that all the time,
but you run the risk of making the cpu angry, especially if you're
doing any math.
As for compiler vectorization, GCC _can_ emit vector instructions, but
I don't think we're going to use -O3 anytime soon.
There's "-ftree-vectorize" but I'm not sure if it exists on Clang.
...
it's not very smart. For instance, we have a function named
memset_aligned_32, which writes aligned 32-byte aligned chunks. But
GCC doesn't know that. It just writes regular quadword instructions.
Wouldn't __attribute__((aligned(32))) work?
...
So that's some complicated code which isn't actually better than a
straightforward uint64_t loop. I think that's the reason I prefer
seeing intrinsics - granted, I have a lot of experience reading them,
and I understand they're unfriendly to people who aren't familiar -
but they give you assurance that the compiler actually works as
expected.
I think writing assembly directly is still best for performance, since we can control instruction scheduling that way.
...
...
Then it's always a matter of a trade-off between optimizing for the
large data case vs optimizing for the small data case. The larger the
building blocks you use, the more you will cripple the small data case,
as you will need to carefully handle the data alignment and handle the
border case.
Bear in mind that most of these functions read single bytes,
presently. So it can't get slower than it already is.
Well, it *does* get slower due to extra branches if the caller makes frequent calls to mem* functions with small data size (e.g. 1-15).
...
...
Note that, especially for the functions which are supposed to stop their
iteration early, you also need to consider whether buffers are always
entirely valid and if you are allowed to larger chunks of data at a
time. It seems to be the case for memcmp, but not for memchr for
instance. [1]
[1]
https://trust-in-soft.com/blog/2015/12/21/memcmp-requires-pointers-to-fully-...
I'm curious whether or not this would crash on native windows. You'd
have that problem even if writing a C optimized memcmp. Maybe the
patch author should be required to test edge cases like this?
It might be helpful to pass partially mapped buffers to the function and see exactly where the access violation happens.
...
...
Like I said in another thread, the memcpy C code that's been adapted
from glibc to msvcrt is IMHO a good example. It may very well be
correct, but looking at it I'm simply unable to say that it is.
Maybe I'm unable to read code, but my first and only impression is that
it's unnecessarily complex. I don't know why it is the way it is,
probably for some obscure historical or specific target architecture
optimization, and, if for some reason we need to optimize it further I
would just be unable to without rewriting it entirely.
Yeah, agreed... that function is awful. I'd like to see code with more comments.
Here's how it works:
1. The dst - src >= n checks if either A. dst precedes src (dst < src), or B. the intervals [dst, dst+n) and [src, src+n) do not overlap each other (|dst - src| >= n). The comparison takes advantage of unsigned integer overflow for condition A, which allows to test both conditions with only single comparison.
2. In the forward copy case, the function first copies a few heading bytes until dst is word-aligned.
3. The function checks if src is also word-aligned yet. If it is, the function performs word-by-word copy.
4. Otherwise, the function still avoids unaligned access. This is achieved by doing aligned load of two consecutive words that contain the unaligned source block, and bit-shifting/merging the loaded words so that the desired part is copied into the destination. The loop continually keeps double-word window of source buffer until the (aligned) end.
5. Lastly, the trailing bytes are copied.
6. Steps 2-5 are done similarly in the reverse direction case.
That said, it does need more work on its verbose pointer arithmetic...
-- 
Sincerely,
Jinoh Kang

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.