Hi Rémi,
I think you're undervaluing the SSE2 codepath. While erms was introduced on Intel CPU's quite long ago it's a fairly new thing on AMD CPU's (as far as I understand the first AMD CPU to set the cpuid flag was released in mid 2019).
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+#ifdef __i386__
- if (n < 2048 && sse2_supported)
 
if ((n < 2048 && sse2_supported) || !erms_supported)
+#else
- if (n < 2048)
 
if (n < 2048 || !erms_supported)
+#endif
- {
 __asm__ __volatile__ ("movd %1, %%xmm0\n\t""pshufd $0, %%xmm0, %%xmm0\n\t""test $0x20, %2\n\t""je 1f\n\t""sub $0x20, %2\n\t""movdqa %%xmm0, 0x00(%0,%2)\n\t""movdqa %%xmm0, 0x10(%0,%2)\n\t""je 2f\n\t""1:\n\t""sub $0x40, %2\n\t""movdqa %%xmm0, 0x00(%0,%2)\n\t""movdqa %%xmm0, 0x10(%0,%2)\n\t""movdqa %%xmm0, 0x20(%0,%2)\n\t""movdqa %%xmm0, 0x30(%0,%2)\n\t""ja 1b\n\t""2:\n\t":: "r"(d), "r"((uint32_t)v), "c"(n): "memory");
Shouldn't xmm0 be added to clobbered registers list?
Thanks, Piotr