Hi Fabian,
I'm still looking on possible ways of optimizing the function. You patch is only affecting a subset of memmove calls. It also slows down some cases a lot (around 1.5-2 times). I don't have ready code yet but it looks like it will be possible to write C implementation that is ~10% slower than native.
Also quick testing shows that gcc and clang optimizes a simple implementation very well. Something like: https://source.winehq.org/patches/data/191083 (it's incorrect, I didn't mean to send it to wine-devel yet) has similar performance as native if -O2 option is used. The same implementation is terribly slow if -O0 is used.
I'm not sure yet how complicated the code that is not depending on compiler to optimize it will be. I'm planning to implement some proof of concept patch to check it next.
I'm hoping that we will come with a better patch but here are few comments about your patch: - the __GNUC__ checks are not needed - the WT alias is not needed - it doesn't work correctly in d==s case on invalid pointers / write watches - it decreases performance a lot if buffers overlap or word copying patch is not used
I've also tested full implementation from musl (that uses their memcpy implementation in some cases). It performs much better. It's much slower than native if buffers overlap (around 3 times slower). It should be possible to optimize this case as well.
Thanks, Piotr