For n larger than 16 we store 16 bytes on each end of the buffer, eventually overlapping, and then 16 additional bytes for n > 32.
Then we can find a 32-byte aligned range overlapping the remaining part of the destination buffer, which is filled 32 bytes at a time in a loop.
Signed-off-by: Rémi Bernon rbernon@codeweavers.com ---
So, this is what I was thinking instead of having fully specialized assembly versions.
Overall I believe the performance should be better than SSE2 for very small sizes and very large sizes (when ERMS kicks in), but a bit worse for 128 <= n <= 1024.
I also don't even think the last patch is really useful, it only helps improving performance for the intermediate sizes, and I would think instead that the ERMS path could cover them instead once it is good enough on most CPUs for that range of size.
dlls/msvcrt/string.c | 60 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 56 insertions(+), 4 deletions(-)
diff --git a/dlls/msvcrt/string.c b/dlls/msvcrt/string.c index 4d09405094d..3a7312572ab 100644 --- a/dlls/msvcrt/string.c +++ b/dlls/msvcrt/string.c @@ -2855,13 +2855,65 @@ void * __cdecl memcpy(void *dst, const void *src, size_t n) return memmove(dst, src, n); }
+static void memset_aligned_32(unsigned char *d, uint64_t v, size_t n) +{ + while (n >= 32) + { + *(uint64_t*)(d + n - 32) = v; + *(uint64_t*)(d + n - 24) = v; + *(uint64_t*)(d + n - 16) = v; + *(uint64_t*)(d + n - 8) = v; + n -= 32; + } +} + /********************************************************************* * memset (MSVCRT.@) */ -void* __cdecl memset(void *dst, int c, size_t n) -{ - volatile unsigned char *d = dst; /* avoid gcc optimizations */ - while (n--) *d++ = c; +void *__cdecl memset(void *dst, int c, size_t n) +{ + uint64_t v = 0x101010101010101ull * (unsigned char)c; + unsigned char *d = (unsigned char *)dst; + size_t a = 0x20 - ((uintptr_t)d & 0x1f); + + if (n >= 16) + { + *(uint64_t *)(d + 0) = v; + *(uint64_t *)(d + 8) = v; + *(uint64_t *)(d + n - 16) = v; + *(uint64_t *)(d + n - 8) = v; + if (n <= 32) return dst; + *(uint64_t *)(d + 16) = v; + *(uint64_t *)(d + 24) = v; + *(uint64_t *)(d + n - 32) = v; + *(uint64_t *)(d + n - 24) = v; + if (n <= 64) return dst; + memset_aligned_32(d + a, v, (n - a) & ~0x1f); + return dst; + } + if (n >= 8) + { + *(uint64_t *)d = v; + *(uint64_t *)(d + n - 8) = v; + return dst; + } + if (n >= 4) + { + *(uint32_t *)d = v; + *(uint32_t *)(d + n - 4) = v; + return dst; + } + if (n >= 2) + { + *(uint16_t *)d = v; + *(uint16_t *)(d + n - 2) = v; + return dst; + } + if (n >= 1) + { + *(uint8_t *)d = v; + return dst; + } return dst; }
Signed-off-by: Rémi Bernon rbernon@codeweavers.com --- include/msvcrt/intrin.h | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/include/msvcrt/intrin.h b/include/msvcrt/intrin.h index 8b84929bc02..bc8a7e20ff7 100644 --- a/include/msvcrt/intrin.h +++ b/include/msvcrt/intrin.h @@ -24,6 +24,10 @@ static inline void __cpuid(int info[4], int ax) { return __cpuidex(info, ax, 0); } +static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{ + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc"); +} #endif
#ifdef __aarch64__
Hi Rémi,
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{
- __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc");
+}
I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
Thanks, Piotr
On 9/13/21 4:51 PM, Piotr Caban wrote:
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{ + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc"); +}
I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
One more thing - shouldn't %ecx be also added as output so gcc knows that it has changed?
On 9/13/21 4:51 PM, Piotr Caban wrote:
Hi Rémi,
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{ + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc"); +}
I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
All the ABIs are apparently requiring it to be cleared before a function call, or am I missing something? So it looks like it's not needed anywhere and I was just over cautious.
On 9/13/21 10:25 AM, Rémi Bernon wrote:
On 9/13/21 4:51 PM, Piotr Caban wrote:
Hi Rémi,
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{ + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc"); +}
I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
All the ABIs are apparently requiring it to be cleared before a function call, or am I missing something? So it looks like it's not needed anywhere and I was just over cautious.
Well, ABIs do, but you're not defining that as an asm function; you're using inline assembly. So you can't guarantee anything.
On 9/13/21 6:42 PM, Zebediah Figura wrote:
On 9/13/21 10:25 AM, Rémi Bernon wrote:
On 9/13/21 4:51 PM, Piotr Caban wrote:
Hi Rémi,
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{ + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc"); +}
I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
All the ABIs are apparently requiring it to be cleared before a function call, or am I missing something? So it looks like it's not needed anywhere and I was just over cautious.
Well, ABIs do, but you're not defining that as an asm function; you're using inline assembly. So you can't guarantee anything.
But it's wrapped in a function, which implies what its calling convention ABI implies?
On 9/13/21 11:53 AM, Rémi Bernon wrote:
On 9/13/21 6:42 PM, Zebediah Figura wrote:
On 9/13/21 10:25 AM, Rémi Bernon wrote:
On 9/13/21 4:51 PM, Piotr Caban wrote:
Hi Rémi,
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{ + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc"); +}
I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
All the ABIs are apparently requiring it to be cleared before a function call, or am I missing something? So it looks like it's not needed anywhere and I was just over cautious.
Well, ABIs do, but you're not defining that as an asm function; you're using inline assembly. So you can't guarantee anything.
But it's wrapped in a function, which implies what its calling convention ABI implies?
No, not really. The compiler is free to insert whatever assembly it wants before and after the __asm__ block, as long as it satisfies the constraints.
Not only that, but because it's a static function, the compiler is also free not to give it a standard calling convention at all.
On 9/13/21 7:00 PM, Zebediah Figura wrote:
On 9/13/21 11:53 AM, Rémi Bernon wrote:
On 9/13/21 6:42 PM, Zebediah Figura wrote:
On 9/13/21 10:25 AM, Rémi Bernon wrote:
On 9/13/21 4:51 PM, Piotr Caban wrote:
Hi Rémi,
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+static inline void __stosb(unsigned char* dst, unsigned char c, size_t n) +{ + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), "D"(dst), "c"(n) : "memory", "cc"); +}
I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
All the ABIs are apparently requiring it to be cleared before a function call, or am I missing something? So it looks like it's not needed anywhere and I was just over cautious.
Well, ABIs do, but you're not defining that as an asm function; you're using inline assembly. So you can't guarantee anything.
But it's wrapped in a function, which implies what its calling convention ABI implies?
No, not really. The compiler is free to insert whatever assembly it wants before and after the __asm__ block, as long as it satisfies the constraints.
Not only that, but because it's a static function, the compiler is also free not to give it a standard calling convention at all.
Well, anyway MSVC doesn't generate cld with this intrinsic so I think we should not either.
On 9/13/21 12:01 PM, Rémi Bernon wrote:
On 9/13/21 7:00 PM, Zebediah Figura wrote:
On 9/13/21 11:53 AM, Rémi Bernon wrote:
On 9/13/21 6:42 PM, Zebediah Figura wrote:
On 9/13/21 10:25 AM, Rémi Bernon wrote:
On 9/13/21 4:51 PM, Piotr Caban wrote:
Hi Rémi,
On 9/13/21 2:23 PM, Rémi Bernon wrote: > +static inline void __stosb(unsigned char* dst, unsigned char c, > size_t n) > +{ > + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), > "D"(dst), "c"(n) : "memory", "cc"); > +} I don't know if it's important here but Microsoft's i386 cdecl abi specifies direction flag value on function call. Maybe if __cdecl is added cld call may be removed.
All the ABIs are apparently requiring it to be cleared before a function call, or am I missing something? So it looks like it's not needed anywhere and I was just over cautious.
Well, ABIs do, but you're not defining that as an asm function; you're using inline assembly. So you can't guarantee anything.
But it's wrapped in a function, which implies what its calling convention ABI implies?
No, not really. The compiler is free to insert whatever assembly it wants before and after the __asm__ block, as long as it satisfies the constraints.
Not only that, but because it's a static function, the compiler is also free not to give it a standard calling convention at all.
Well, anyway MSVC doesn't generate cld with this intrinsic so I think we should not either.
I don't see why that means anything. At best, that just means MSVC is checking whether the direction flag was already clear, and not clearing it again. In theory, GCC could do that too, but I don't see any clear way to make the value of DF an input constraint.
On 9/13/21 7:08 PM, Zebediah Figura wrote:
On 9/13/21 12:01 PM, Rémi Bernon wrote:
On 9/13/21 7:00 PM, Zebediah Figura wrote:
On 9/13/21 11:53 AM, Rémi Bernon wrote:
On 9/13/21 6:42 PM, Zebediah Figura wrote:
On 9/13/21 10:25 AM, Rémi Bernon wrote:
On 9/13/21 4:51 PM, Piotr Caban wrote: > Hi Rémi, > > On 9/13/21 2:23 PM, Rémi Bernon wrote: >> +static inline void __stosb(unsigned char* dst, unsigned char c, >> size_t n) >> +{ >> + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), >> "D"(dst), "c"(n) : "memory", "cc"); >> +} > I don't know if it's important here but Microsoft's i386 cdecl abi > specifies direction flag value on function call. Maybe if __cdecl is > added cld call may be removed. >
All the ABIs are apparently requiring it to be cleared before a function call, or am I missing something? So it looks like it's not needed anywhere and I was just over cautious.
Well, ABIs do, but you're not defining that as an asm function; you're using inline assembly. So you can't guarantee anything.
But it's wrapped in a function, which implies what its calling convention ABI implies?
No, not really. The compiler is free to insert whatever assembly it wants before and after the __asm__ block, as long as it satisfies the constraints.
Not only that, but because it's a static function, the compiler is also free not to give it a standard calling convention at all.
Well, anyway MSVC doesn't generate cld with this intrinsic so I think we should not either.
I don't see why that means anything. At best, that just means MSVC is checking whether the direction flag was already clear, and not clearing it again. In theory, GCC could do that too, but I don't see any clear way to make the value of DF an input constraint.
I don't think it's doing that, and there's also probably no point.
These intrinsics are meant to generate assembly instructions, and so you can very well combine them by setting the direction flag before hand, for instance with __writeeflags and effectively reverse later __stosb or __movsb.
Then although it's what it does it's not documented and maybe isn't very safe.
Probably I should just add the "cld; rep; stosb" inline instead.
On 9/13/21 7:56 PM, Rémi Bernon wrote:
On 9/13/21 7:08 PM, Zebediah Figura wrote:
On 9/13/21 12:01 PM, Rémi Bernon wrote:
On 9/13/21 7:00 PM, Zebediah Figura wrote:
On 9/13/21 11:53 AM, Rémi Bernon wrote:
On 9/13/21 6:42 PM, Zebediah Figura wrote:
On 9/13/21 10:25 AM, Rémi Bernon wrote: > On 9/13/21 4:51 PM, Piotr Caban wrote: >> Hi Rémi, >> >> On 9/13/21 2:23 PM, Rémi Bernon wrote: >>> +static inline void __stosb(unsigned char* dst, unsigned char c, >>> size_t n) >>> +{ >>> + __asm__ __volatile__ ("cld; rep; stosb" : "=D"(dst) : "a"(c), >>> "D"(dst), "c"(n) : "memory", "cc"); >>> +} >> I don't know if it's important here but Microsoft's i386 cdecl abi >> specifies direction flag value on function call. Maybe if >> __cdecl is >> added cld call may be removed. >> > > All the ABIs are apparently requiring it to be cleared before a > function > call, or am I missing something? So it looks like it's not needed > anywhere and I was just over cautious. >
Well, ABIs do, but you're not defining that as an asm function; you're using inline assembly. So you can't guarantee anything.
But it's wrapped in a function, which implies what its calling convention ABI implies?
No, not really. The compiler is free to insert whatever assembly it wants before and after the __asm__ block, as long as it satisfies the constraints.
Not only that, but because it's a static function, the compiler is also free not to give it a standard calling convention at all.
Well, anyway MSVC doesn't generate cld with this intrinsic so I think we should not either.
I don't see why that means anything. At best, that just means MSVC is checking whether the direction flag was already clear, and not clearing it again. In theory, GCC could do that too, but I don't see any clear way to make the value of DF an input constraint.
I don't think it's doing that, and there's also probably no point.
These intrinsics are meant to generate assembly instructions, and so you can very well combine them by setting the direction flag before hand, for instance with __writeeflags and effectively reverse later __stosb or __movsb.
Then although it's what it does it's not documented and maybe isn't very safe.
Probably I should just add the "cld; rep; stosb" inline instead.
Yes. This will also kind of match with what LLVM has done (they are force inlining memset in this case).
Signed-off-by: Rémi Bernon rbernon@codeweavers.com --- dlls/msvcrt/math.c | 16 ++++++++++++++++ dlls/msvcrt/msvcrt.h | 1 + dlls/msvcrt/string.c | 5 +++++ 3 files changed, 22 insertions(+)
diff --git a/dlls/msvcrt/math.c b/dlls/msvcrt/math.c index 7f59a4d20d4..6639bb5ee23 100644 --- a/dlls/msvcrt/math.c +++ b/dlls/msvcrt/math.c @@ -43,6 +43,7 @@ #include <limits.h> #include <locale.h> #include <math.h> +#include <intrin.h>
#include "msvcrt.h" #include "winternl.h" @@ -64,11 +65,26 @@ typedef int (CDECL *MSVCRT_matherr_func)(struct _exception *);
static MSVCRT_matherr_func MSVCRT_default_matherr_func = NULL;
+BOOL erms_supported; BOOL sse2_supported; static BOOL sse2_enabled;
void msvcrt_init_math( void *module ) { +#if defined(__i386__) || defined(__x86_64__) + int regs[4]; + + __cpuid(regs, 0); + if (regs[0] < 7) erms_supported = FALSE; + else + { + __cpuidex(regs, 7, 0); + erms_supported = ((regs[1] >> 9) & 1); + } +#else + erms_supported = FALSE; +#endif + sse2_supported = IsProcessorFeaturePresent( PF_XMMI64_INSTRUCTIONS_AVAILABLE ); #if _MSVCR_VER <=71 sse2_enabled = FALSE; diff --git a/dlls/msvcrt/msvcrt.h b/dlls/msvcrt/msvcrt.h index 60f8c2f5ef2..022eced35d9 100644 --- a/dlls/msvcrt/msvcrt.h +++ b/dlls/msvcrt/msvcrt.h @@ -33,6 +33,7 @@ #undef strncpy #undef wcsncpy
+extern BOOL erms_supported DECLSPEC_HIDDEN; extern BOOL sse2_supported DECLSPEC_HIDDEN;
#define DBL80_MAX_10_EXP 4932 diff --git a/dlls/msvcrt/string.c b/dlls/msvcrt/string.c index 3a7312572ab..d09b44fbcd6 100644 --- a/dlls/msvcrt/string.c +++ b/dlls/msvcrt/string.c @@ -27,6 +27,7 @@ #include <math.h> #include <limits.h> #include <locale.h> +#include <intrin.h> #include <float.h> #include "msvcrt.h" #include "bnum.h" @@ -2857,6 +2858,10 @@ void * __cdecl memcpy(void *dst, const void *src, size_t n)
static void memset_aligned_32(unsigned char *d, uint64_t v, size_t n) { +#if defined(__i386__) || defined(__x86_64__) + if (n >= 2048 && erms_supported) __stosb(d, v, n); + else +#endif while (n >= 32) { *(uint64_t*)(d + n - 32) = v;
On 9/13/21 2:23 PM, Rémi Bernon wrote:
void msvcrt_init_math( void *module ) { +#if defined(__i386__) || defined(__x86_64__)
- int regs[4];
- __cpuid(regs, 0);
- if (regs[0] < 7) erms_supported = FALSE;
- else
- {
__cpuidex(regs, 7, 0);
erms_supported = ((regs[1] >> 9) & 1);
- }
+#else
- erms_supported = FALSE;
+#endif
erms_supported is zeroed (and memset is even called before msvcrt_init_math function call). It can be changed to: #if defined(__i386__) || defined(__x86_64__) int regs[4];
__cpuid(regs, 0); if (regs[0] >= 7) { __cpuid(regs, 7); erms_supported = ((regs[1] >> 9) & 1); } #endif
There's one more thing that is worth mentioning. In ntdll we're checking if cpuid is available. On the other hand it's used without the check in wineboot. I guess it's OK to use cpuid without the check.
static void memset_aligned_32(unsigned char *d, uint64_t v, size_t n) { +#if defined(__i386__) || defined(__x86_64__)
- if (n >= 2048 && erms_supported) __stosb(d, v, n);
- else
How about changing the code in a way that no weird indentation is introduced: if (n >= 2048 && erms_supported) { __stosb(d, v, n); return d; }
Thanks, Piotr
For intermediate sizes.
Signed-off-by: Rémi Bernon rbernon@codeweavers.com --- dlls/msvcrt/string.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/dlls/msvcrt/string.c b/dlls/msvcrt/string.c index d09b44fbcd6..6e9fb8d119d 100644 --- a/dlls/msvcrt/string.c +++ b/dlls/msvcrt/string.c @@ -2859,7 +2859,35 @@ void * __cdecl memcpy(void *dst, const void *src, size_t n) static void memset_aligned_32(unsigned char *d, uint64_t v, size_t n) { #if defined(__i386__) || defined(__x86_64__) - if (n >= 2048 && erms_supported) __stosb(d, v, n); +#ifdef __i386__ + if (n < 2048 && sse2_supported) +#else + if (n < 2048) +#endif + { + __asm__ __volatile__ ( + "movd %1, %%xmm0\n\t" + "pshufd $0, %%xmm0, %%xmm0\n\t" + "test $0x20, %2\n\t" + "je 1f\n\t" + "sub $0x20, %2\n\t" + "movdqa %%xmm0, 0x00(%0,%2)\n\t" + "movdqa %%xmm0, 0x10(%0,%2)\n\t" + "je 2f\n\t" + "1:\n\t" + "sub $0x40, %2\n\t" + "movdqa %%xmm0, 0x00(%0,%2)\n\t" + "movdqa %%xmm0, 0x10(%0,%2)\n\t" + "movdqa %%xmm0, 0x20(%0,%2)\n\t" + "movdqa %%xmm0, 0x30(%0,%2)\n\t" + "ja 1b\n\t" + "2:\n\t" + : + : "r"(d), "r"((uint32_t)v), "c"(n) + : "memory" + ); + } + else if (erms_supported) __stosb(d, v, n); else #endif while (n >= 32)
Hi Rémi,
I think you're undervaluing the SSE2 codepath. While erms was introduced on Intel CPU's quite long ago it's a fairly new thing on AMD CPU's (as far as I understand the first AMD CPU to set the cpuid flag was released in mid 2019).
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+#ifdef __i386__
- if (n < 2048 && sse2_supported)
if ((n < 2048 && sse2_supported) || !erms_supported)
+#else
- if (n < 2048)
if (n < 2048 || !erms_supported)
+#endif
- {
__asm__ __volatile__ (
"movd %1, %%xmm0\n\t"
"pshufd $0, %%xmm0, %%xmm0\n\t"
"test $0x20, %2\n\t"
"je 1f\n\t"
"sub $0x20, %2\n\t"
"movdqa %%xmm0, 0x00(%0,%2)\n\t"
"movdqa %%xmm0, 0x10(%0,%2)\n\t"
"je 2f\n\t"
"1:\n\t"
"sub $0x40, %2\n\t"
"movdqa %%xmm0, 0x00(%0,%2)\n\t"
"movdqa %%xmm0, 0x10(%0,%2)\n\t"
"movdqa %%xmm0, 0x20(%0,%2)\n\t"
"movdqa %%xmm0, 0x30(%0,%2)\n\t"
"ja 1b\n\t"
"2:\n\t"
:
: "r"(d), "r"((uint32_t)v), "c"(n)
: "memory"
);
Shouldn't xmm0 be added to clobbered registers list?
Thanks, Piotr
On 9/13/21 4:50 PM, Piotr Caban wrote:
Hi Rémi,
I think you're undervaluing the SSE2 codepath. While erms was introduced on Intel CPU's quite long ago it's a fairly new thing on AMD CPU's (as far as I understand the first AMD CPU to set the cpuid flag was released in mid 2019).
Okay, I admit that I don't know precisely which CPUs era are covered. But even in that case, I'm not sure that it's worth introducing an SSE2 code path.
Although the non-vectorized code is twice or three times slower than what SSE2 could do, it's still 25 times faster than the current code, which IMHO is good enough for most CPUs, and doesn't need specific instructions.
I'm also sure SSE2 (and ERMS as well) have a lot of quirks and performance profile variations across CPU models and I feel like it can be very tricky and a bit worthless to try to finely optimize them.
Yet, I'm only arguing because I felt it was possible to write a good enough implementation in C. I don't mind very much in the end.
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+#ifdef __i386__ + if (n < 2048 && sse2_supported)
if ((n < 2048 && sse2_supported) || !erms_supported)
+#else + if (n < 2048)
if (n < 2048 || !erms_supported)
+#endif + { + __asm__ __volatile__ ( + "movd %1, %%xmm0\n\t" + "pshufd $0, %%xmm0, %%xmm0\n\t" + "test $0x20, %2\n\t" + "je 1f\n\t" + "sub $0x20, %2\n\t" + "movdqa %%xmm0, 0x00(%0,%2)\n\t" + "movdqa %%xmm0, 0x10(%0,%2)\n\t" + "je 2f\n\t" + "1:\n\t" + "sub $0x40, %2\n\t" + "movdqa %%xmm0, 0x00(%0,%2)\n\t" + "movdqa %%xmm0, 0x10(%0,%2)\n\t" + "movdqa %%xmm0, 0x20(%0,%2)\n\t" + "movdqa %%xmm0, 0x30(%0,%2)\n\t" + "ja 1b\n\t" + "2:\n\t" + : + : "r"(d), "r"((uint32_t)v), "c"(n) + : "memory" + );
Shouldn't xmm0 be added to clobbered registers list?
I guess, yes and "cc" for the flags too I suppose.