Re: [PATCH] msvcrt: Improve memset performance on i386 and x86_64 architectures.

12 Sep 2021

      On 9/11/21 8:30 PM, Rémi Bernon wrote:
...
On 9/11/21 7:38 PM, Piotr Caban wrote:
...
On 9/11/21 4:41 PM, Rémi Bernon wrote:
...
On 9/11/21 8:51 AM, Piotr Caban wrote:
...
Signed-off-by: Piotr Caban piotr@codeweavers.com
dlls/msvcrt/string.c | 126 
+++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 126 insertions(+)
FWIW as far as I can see on my simple throughput benchmarks, and with 
the default optimization flags (-O2), the unrolled C version:

Outperforms the SSE2 assembly on x86_64 for n <= 32 (20GB/s vs

12GB/s for n = 32), and performs equally as good for "aligned" 
operations on larger sizes.

It performs roughly at a third (25GB/s vs 70GB/s on my computer) on

unaligned operations like memset(dst + 1, src, n) and n >= 256.

On i686 it performs equally for small sizes (n <= 128) and then

performs at half the throughput (35GB/s vs 70GB/s) for aligned 
operations and a third for unaligned ones.
It still has the advantage of being C code, benefiting all 
architectures.
I think we should also improve the C implementation (I was planning to 
encourage you to upstream it).
Sure, I will then.
...
I don't have your full benchmark results but I think that the general 
conclusion is that SSE implementation is equally good or much faster 
for n>=64. I will need to improve the n<64 case.
Here are some results from my machine (x86_64, it shows how SSE 
implementation compares to yours):
  - 64MB aligned block - 1.2 * faster
  - 64MB unaligned - 1.3 * faster
  - 1MB aligned - 2 * faster
  - 1MB unaligned - 5 * faster
  - 32 bytes aligned - 2 * slower
  - 32 bytes unaligned - 2.3 * slower
  - 9 bytes - 1.3 * slower
Thanks,
Piotr
The SSE2 version is definitely still better in a lot of cases, and 
especially for large sizes.
Then, for these cases I'm thinking that the erms approach is probably 
the most future-proof.
Its implementation is very simple and it provides the best performance 
possible for large sizes, at least on recent CPUs, usually faster than 
SSE2, and possibly better for the CPU cache (I believe).
It also looks like that Intel and AMD intend to keep on improving rep 
movsb/stosb performance, and make it the preferred way to copy or clear 
memory, over any vectorized instruction implementation, so it could even 
be the best for small sizes.
I'm attaching the results I have for all the versions, including an AVX 
implementation that I had (although I've done it with intel intrinsics 
instead of assembly).
(I actually modified a bit the unrolled C version for these results, as 
reversing the order of the assignments in the loops did seem to improve 
performance somehow for the unaligned cases.)
Cheers,
Forgot to mention it, but I think this is an interesting read on that topic:
https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-rout...
-- 
Rémi Bernon rbernon@codeweavers.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH] msvcrt: Improve memset performance on i386 and x86_64 architectures.

Signed-off-by: Piotr Caban piotr@codeweavers.com