Re: [PATCH 4/4] msvcrt: Add an SSE2 memset_aligned_32 implementation.

13 Sep 2021


      On 9/13/21 4:50 PM, Piotr Caban wrote:
...
Hi Rémi,
I think you're undervaluing the SSE2 codepath. While erms was introduced 
on Intel CPU's quite long ago it's a fairly new thing on AMD CPU's (as 
far as I understand the first AMD CPU to set the cpuid flag was released 
in mid 2019).
Okay, I admit that I don't know precisely which CPUs era are covered. 
But even in that case, I'm not sure that it's worth introducing an SSE2 
code path.
Although the non-vectorized code is twice or three times slower than 
what SSE2 could do, it's still 25 times faster than the current code, 
which IMHO is good enough for most CPUs, and doesn't need specific 
instructions.
I'm also sure SSE2 (and ERMS as well) have a lot of quirks and 
performance profile variations across CPU models and I feel like it can 
be very tricky and a bit worthless to try to finely optimize them.
Yet, I'm only arguing because I felt it was possible to write a good 
enough implementation in C. I don't mind very much in the end.
...
On 9/13/21 2:23 PM, Rémi Bernon wrote:
...
+#ifdef __i386__
+    if (n < 2048 && sse2_supported)
if ((n < 2048 && sse2_supported) || !erms_supported)
...
+#else
+    if (n < 2048)
if (n < 2048 || !erms_supported)
...
+#endif
+    {
+        __asm__ __volatile__ (
+            "movd %1, %%xmm0\n\t"
+            "pshufd $0, %%xmm0, %%xmm0\n\t"
+            "test $0x20, %2\n\t"
+            "je 1f\n\t"
+            "sub $0x20, %2\n\t"
+            "movdqa %%xmm0, 0x00(%0,%2)\n\t"
+            "movdqa %%xmm0, 0x10(%0,%2)\n\t"
+            "je 2f\n\t"
+            "1:\n\t"
+            "sub $0x40, %2\n\t"
+            "movdqa %%xmm0, 0x00(%0,%2)\n\t"
+            "movdqa %%xmm0, 0x10(%0,%2)\n\t"
+            "movdqa %%xmm0, 0x20(%0,%2)\n\t"
+            "movdqa %%xmm0, 0x30(%0,%2)\n\t"
+            "ja 1b\n\t"
+            "2:\n\t"
+            :
+            : "r"(d), "r"((uint32_t)v), "c"(n)
+            : "memory"
+        );
Shouldn't xmm0 be added to clobbered registers list?
I guess, yes and "cc" for the flags too I suppose.
-- 
Rémi Bernon rbernon@codeweavers.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH 4/4] msvcrt: Add an SSE2 memset_aligned_32 implementation.