On 9/11/21 8:30 PM, Rémi Bernon wrote:
On 9/11/21 7:38 PM, Piotr Caban wrote:
On 9/11/21 4:41 PM, Rémi Bernon wrote:
On 9/11/21 8:51 AM, Piotr Caban wrote:
Signed-off-by: Piotr Caban piotr@codeweavers.com
dlls/msvcrt/string.c | 126 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 126 insertions(+)
FWIW as far as I can see on my simple throughput benchmarks, and with the default optimization flags (-O2), the unrolled C version:
- Outperforms the SSE2 assembly on x86_64 for n <= 32 (20GB/s vs
12GB/s for n = 32), and performs equally as good for "aligned" operations on larger sizes.
- It performs roughly at a third (25GB/s vs 70GB/s on my computer) on
unaligned operations like memset(dst + 1, src, n) and n >= 256.
- On i686 it performs equally for small sizes (n <= 128) and then
performs at half the throughput (35GB/s vs 70GB/s) for aligned operations and a third for unaligned ones.
It still has the advantage of being C code, benefiting all architectures.
I think we should also improve the C implementation (I was planning to encourage you to upstream it).
Sure, I will then.
I don't have your full benchmark results but I think that the general conclusion is that SSE implementation is equally good or much faster for n>=64. I will need to improve the n<64 case.
Here are some results from my machine (x86_64, it shows how SSE implementation compares to yours): - 64MB aligned block - 1.2 * faster - 64MB unaligned - 1.3 * faster - 1MB aligned - 2 * faster - 1MB unaligned - 5 * faster - 32 bytes aligned - 2 * slower - 32 bytes unaligned - 2.3 * slower - 9 bytes - 1.3 * slower
Thanks, Piotr
The SSE2 version is definitely still better in a lot of cases, and especially for large sizes.
Then, for these cases I'm thinking that the erms approach is probably the most future-proof.
Its implementation is very simple and it provides the best performance possible for large sizes, at least on recent CPUs, usually faster than SSE2, and possibly better for the CPU cache (I believe).
It also looks like that Intel and AMD intend to keep on improving rep movsb/stosb performance, and make it the preferred way to copy or clear memory, over any vectorized instruction implementation, so it could even be the best for small sizes.
I'm attaching the results I have for all the versions, including an AVX implementation that I had (although I've done it with intel intrinsics instead of assembly).
(I actually modified a bit the unrolled C version for these results, as reversing the order of the assignments in the loops did seem to improve performance somehow for the unaligned cases.)
Cheers,
Forgot to mention it, but I think this is an interesting read on that topic:
https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-rout...