Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=49663 Signed-off-by: Fabian Maurer dark.shadow4@web.de --- dlls/msvcrt/string.c | 53 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 41 insertions(+), 12 deletions(-)
diff --git a/dlls/msvcrt/string.c b/dlls/msvcrt/string.c index ab6747db5c..fb2ef3ee01 100644 --- a/dlls/msvcrt/string.c +++ b/dlls/msvcrt/string.c @@ -30,6 +30,7 @@ #include <stdio.h> #include <math.h> #include <limits.h> +#include <stdint.h> #include <errno.h> #include "msvcrt.h" #include "bnum.h" @@ -2482,22 +2483,50 @@ int __cdecl MSVCRT_memcmp(const void *ptr1, const void *ptr2, MSVCRT_size_t n)
/********************************************************************* * memmove (MSVCRT.@) + * Copied from musl (http://www.musl-libc.org/): src/string/memmove.c + * + * ==================================================== + * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved. + * + * Developed at SunSoft, a Sun Microsystems, Inc. business. + * Permission to use, copy, modify, and distribute this + * software is freely granted, provided that this notice + * is preserved. + * ==================================================== */ void * __cdecl MSVCRT_memmove(void *dst, const void *src, MSVCRT_size_t n) { - volatile unsigned char *d = dst; /* avoid gcc optimizations */ - const unsigned char *s = src; - - if ((MSVCRT_size_t)dst - (MSVCRT_size_t)src >= n) - { - while (n--) *d++ = *s++; - } - else - { - d += n - 1; - s += n - 1; - while (n--) *d-- = *s--; + char *d = dst; + const char *s = src; + typedef __attribute__((__may_alias__)) size_t WT; + const int WS = sizeof(WT); + + if (d==s) return d; + + if (d<s) { +#ifdef __GNUC__ + if ((uintptr_t)s % WS == (uintptr_t)d % WS) { + while ((uintptr_t)d % WS) { + if (!n--) return dst; + *d++ = *s++; + } + for (; n>=WS; n-=WS, d+=WS, s+=WS) *(WT *)d = *(WT *)s; + } +#endif + for (; n; n--) *d++ = *s++; + } else { +#ifdef __GNUC__ + if ((uintptr_t)s % WS == (uintptr_t)d % WS) { + while ((uintptr_t)(d+n) % WS) { + if (!n--) return dst; + d[n] = s[n]; + } + while (n>=WS) n-=WS, *(WT *)(d+n) = *(WT *)(s+n); + } +#endif + while (n) n--, d[n] = s[n]; } + return dst; }
-- 2.28.0
Hi Fabian,
I'm still looking on possible ways of optimizing the function. You patch is only affecting a subset of memmove calls. It also slows down some cases a lot (around 1.5-2 times). I don't have ready code yet but it looks like it will be possible to write C implementation that is ~10% slower than native.
Also quick testing shows that gcc and clang optimizes a simple implementation very well. Something like: https://source.winehq.org/patches/data/191083 (it's incorrect, I didn't mean to send it to wine-devel yet) has similar performance as native if -O2 option is used. The same implementation is terribly slow if -O0 is used.
I'm not sure yet how complicated the code that is not depending on compiler to optimize it will be. I'm planning to implement some proof of concept patch to check it next.
I'm hoping that we will come with a better patch but here are few comments about your patch: - the __GNUC__ checks are not needed - the WT alias is not needed - it doesn't work correctly in d==s case on invalid pointers / write watches - it decreases performance a lot if buffers overlap or word copying patch is not used
I've also tested full implementation from musl (that uses their memcpy implementation in some cases). It performs much better. It's much slower than native if buffers overlap (around 3 times slower). It should be possible to optimize this case as well.
Thanks, Piotr
Hello Piotr,
it would probably also be useful to benchmark the different glibc implementations. Because for games, 10% more speed would be nice. I doubt a C implementation can compete with an AVX based one.
You patch is only affecting a subset of memmove calls. It also slows down
some cases a lot (around 1.5-2 times).
You mean the cases were we could use memcpy?
I've also tested full implementation from musl (that uses their memcpy
implementation in some cases). It performs much better. It's much slower than native if buffers overlap (around 3 times slower).
musl is slower in a lot of cases. I'm attaching a cheap test program. You can compile it normally with "gcc" or you can link musl static with "musl-gcc". That should compare the best glibc implementation vs the best musl implementation. Correct me if I'm wrong though.
Regards, Fabian Maurer
Hi Fabian,
On 8/20/20 7:13 PM, Fabian Maurer wrote:
it would probably also be useful to benchmark the different glibc implementations. Because for games, 10% more speed would be nice. I doubt a C implementation can compete with an AVX based one.
Yes, we might need to add better (platform specific) implementation to wine at some point. It still makes sense to improve generic code (especially if its performance can be easily increased 2-3 times for bigger move operations).
You patch is only affecting a subset of memmove calls. It also slows down
some cases a lot (around 1.5-2 times).
You mean the cases were we could use memcpy?
And the cases when buffers can't be word aligned.
I've also tested full implementation from musl (that uses their memcpy
implementation in some cases). It performs much better. It's much slower than native if buffers overlap (around 3 times slower).
musl is slower in a lot of cases. I'm attaching a cheap test program. You can compile it normally with "gcc" or you can link musl static with "musl-gcc". That should compare the best glibc implementation vs the best musl implementation. Correct me if I'm wrong though.
I was comparing musl, glibc and native implementation. In i386 case, on average, glibc was the fastest one (it was e.g. ~2 times faster for big memmove's than both musl and native msvcrt). musl was performing similar as native msvcrt in memcpy case. When memory was copied starting from the end musl was much slower (~3 times).
Thanks, Piotr
FWIW "rep movsb" is supposedly the fastest when transferring larger blocks (I think more than 128 bytes?) on recent CPUs. The cool thing is that the CPU handles everything, no matter the alignment or "memcpy vs memmove", so it's by far the simplest, and since it knows about the alignment requirements of that particular CPU it can optimize it internally itself.
Same story with "rep stosb" for memset. Unfortunately these are very slow on older CPUs. I think there's a CPUID flag that says whether they are fast, we could use that.
On 8/21/20 14:51, Gabriel Ivăncescu wrote:
FWIW "rep movsb" is supposedly the fastest when transferring larger blocks (I think more than 128 bytes?) on recent CPUs. The cool thing is that the CPU handles everything, no matter the alignment or "memcpy vs memmove", so it's by far the simplest, and since it knows about the alignment requirements of that particular CPU it can optimize it internally itself.
Same story with "rep stosb" for memset. Unfortunately these are very slow on older CPUs. I think there's a CPUID flag that says whether they are fast, we could use that.
Yeah, that is (__cpuid(7):EBX & (1 << 9)).
On 8/21/20 1:51 PM, Gabriel Ivăncescu wrote:
FWIW "rep movsb" is supposedly the fastest when transferring larger blocks (I think more than 128 bytes?) on recent CPUs. The cool thing is that the CPU handles everything, no matter the alignment or "memcpy vs memmove", so it's by far the simplest, and since it knows about the alignment requirements of that particular CPU it can optimize it internally itself.
Same story with "rep stosb" for memset. Unfortunately these are very slow on older CPUs. I think there's a CPUID flag that says whether they are fast, we could use that.
"rep movsb" is ~3 times slower than "rep movl" on my cpu (AMD Ryzen 7 2700X). Maybe the single byte variant is better optimized on Intel cpus.
Also the "rep movl" implementation is still almost ~2 times slower than memmove from glibc (tested on 64MB data blocks).
Thanks, Piotr
Hi Gabriel,
I was experimenting with various attempts of implementing memmove. I'm attaching a modified version of Paul's test application. It compares memmove performance from ucrtbase, msvcr100 and msvcrt dlls. It also contains assembler (i386) implementation of the function.
Thanks, Piotr
On 8/21/20 20:21, Piotr Caban wrote:
Hi Gabriel,
I was experimenting with various attempts of implementing memmove. I'm attaching a modified version of Paul's test application. It compares memmove performance from ucrtbase, msvcr100 and msvcrt dlls. It also contains assembler (i386) implementation of the function.
Thanks, Piotr
Was there any reason why you excluded the latest vcruntime140 from comparison? As it is newer implementation it is obvious to expect that it is be better optimized for modern CPUs.
On 8/21/20 9:29 PM, Paul Gofman wrote:
Was there any reason why you excluded the latest vcruntime140 from comparison? As it is newer implementation it is obvious to expect that it is be better optimized for modern CPUs.
No, I must have dropped it at some point. On my system ucrtbase is newer than vcruntime140 (it got updated in Windows 10 1909).
Thanks, Piotr
On 21/08/2020 20:21, Piotr Caban wrote:
Hi Gabriel,
I was experimenting with various attempts of implementing memmove. I'm attaching a modified version of Paul's test application. It compares memmove performance from ucrtbase, msvcr100 and msvcrt dlls. It also contains assembler (i386) implementation of the function.
Thanks, Piotr
Hi Piotr,
Here are the results on a Haswell Xeon E3-1241 v3 CPU (all 32-bit to compare with your assembly implementation). I've also added an extra test (attached function) that simply uses `rep movsb`.
Quick Summary: Your assembler implementation is very good overall compared to the one from Windows 10 (ucrtbase). The only time it is significantly slower (10%) is in the "aligned non-overlap" case (the first test). In other cases it performs just as well as ucrtbase.
The simple "rep movsb" function I added as a quick test is also faster than your assembly implementation for this case only (aligned, non-overlapped).
However, it is extremely slow in overlap cases, where we copy backwards. I guess the CPU is not optimized for copying backwards with it. On your CPU, I understand `rep movsl` is faster even in the first test than `rep movsb`?
One last thing worth mentioning is "small moves" case: it seems the older runtimes do much better here. I think we can do something separately with those, without using movsb/movsl, which I understand require some startup time from the CPU to do alignment checks and so on before it goes full speed copying at maximum bandwidth.
Here's the entire log:
Test ucrtbase implementation Aligned Elapsed time 2659ms. Non-aligned Elapsed time 3004ms. Aligned overlap Elapsed time 2817ms. Non-aligned overlap Elapsed time 2871ms. src==dst Elapsed time 2345ms. Small moves Elapsed time 310ms. Small moves Elapsed time 313ms. Small moves Elapsed time 308ms. correctness test Elapsed time 2163ms. Test msvcr100 implementation Aligned Elapsed time 3674ms. Non-aligned Elapsed time 2998ms. Aligned overlap Elapsed time 2808ms. Non-aligned overlap Elapsed time 2853ms. src==dst Elapsed time 2397ms. Small moves Elapsed time 115ms. Small moves Elapsed time 196ms. Small moves Elapsed time 328ms. correctness test Elapsed time 2142ms. Test msvcrt implementation Aligned Elapsed time 3669ms. Non-aligned Elapsed time 2967ms. Aligned overlap Elapsed time 2829ms. Non-aligned overlap Elapsed time 2872ms. src==dst Elapsed time 2410ms. Small moves Elapsed time 129ms. Small moves Elapsed time 197ms. Small moves Elapsed time 332ms. correctness test Elapsed time 2168ms. Test assembler implementation Aligned Elapsed time 2940ms. Non-aligned Elapsed time 2985ms. Aligned overlap Elapsed time 2809ms. Non-aligned overlap Elapsed time 2848ms. src==dst Elapsed time 2813ms. Small moves Elapsed time 271ms. Small moves Elapsed time 491ms. Small moves Elapsed time 292ms. correctness test Elapsed time 2156ms. Test rep movsb implementation Aligned Elapsed time 2731ms. Non-aligned Elapsed time 3042ms. Aligned overlap Elapsed time 5910ms. Non-aligned overlap Elapsed time 5910ms. src==dst Elapsed time 5912ms. Small moves Elapsed time 289ms. Small moves Elapsed time 287ms. Small moves Elapsed time 288ms. correctness test Elapsed time 2181ms. done
On 8/22/20 5:10 PM, Gabriel Ivăncescu wrote:
I understand `rep movsl` is faster even in the first test than `rep movsb`?
No, it was faster in "Non-aligned", "Aligned overlap" and "Non-aligned overlap" tests. In the "Aligned" case the performance was identical no matter if movsb or movsl was used.
I'm also attaching simple sse2 implementation for comparison. It's faster than the previous one on my machine. I'm also attaching results from running the test on Windows (in VM).
Thanks, Piotr
On 25/08/2020 20:15, Piotr Caban wrote:
On 8/22/20 5:10 PM, Gabriel Ivăncescu wrote:
I understand `rep movsl` is faster even in the first test than `rep movsb`?
No, it was faster in "Non-aligned", "Aligned overlap" and "Non-aligned overlap" tests. In the "Aligned" case the performance was identical no matter if movsb or movsl was used.
I'm also attaching simple sse2 implementation for comparison. It's faster than the previous one on my machine. I'm also attaching results from running the test on Windows (in VM).
Thanks, Piotr
In most cases, the SSE version performs very well, in fact slightly better than the Windows implementation, and does very well for small moves.
Unfortunately, for some reason, it seems it's quite significantly slower (20% or more) only on the "non-overlapped" case. Attached results.
Thanks, Gabriel
On 26/08/2020 17:01, Gabriel Ivăncescu wrote:
On 25/08/2020 20:15, Piotr Caban wrote:
On 8/22/20 5:10 PM, Gabriel Ivăncescu wrote:
I understand `rep movsl` is faster even in the first test than `rep movsb`?
No, it was faster in "Non-aligned", "Aligned overlap" and "Non-aligned overlap" tests. In the "Aligned" case the performance was identical no matter if movsb or movsl was used.
I'm also attaching simple sse2 implementation for comparison. It's faster than the previous one on my machine. I'm also attaching results from running the test on Windows (in VM).
Thanks, Piotr
In most cases, the SSE version performs very well, in fact slightly better than the Windows implementation, and does very well for small moves.
Unfortunately, for some reason, it seems it's quite significantly slower (20% or more) only on the "non-overlapped" case. Attached results.
Thanks, Gabriel
Also, sorry I forgot to mention a small thing, is there a reason you're using movdq(a|u) instead of movaps/movups (which are also SSE1 not SSE2)? They have smaller encoding and should very slightly help with the instruction cache, and no CPU cares about floating vs int states when doing only moves. (even if it did, most operations on SSE tend to be for floats anyway, assuming some broken CPU has some false dependency on them, but I doubt it)
On 8/26/20 5:19 PM, Gabriel Ivăncescu wrote:
Also, sorry I forgot to mention a small thing, is there a reason you're using movdq(a|u) instead of movaps/movups (which are also SSE1 not SSE2)? They have smaller encoding and should very slightly help with the instruction cache, and no CPU cares about floating vs int states when doing only moves. (even if it did, most operations on SSE tend to be for floats anyway, assuming some broken CPU has some false dependency on them, but I doubt it)
I was considering adding a "fast path" that uses movntdq for large moves. It generally speeds things up if whole dest buffer doesn't fit into cache. The movntdq has no SSE1 equivalent and I didn't want to mix SSE1 and SSE2 instructions (I was also planning to guard this code with sse2_enabled variable so it will not run on non-SSE2 capable hardware (see _set_SSE2_enable function)).
I'm still not sure how the final implementation will look like. Maybe it will make sense to use movaps instead.
Thanks, Piotr