On 9/13/21 4:50 PM, Piotr Caban wrote:
Hi Rémi,
I think you're undervaluing the SSE2 codepath. While erms was introduced on Intel CPU's quite long ago it's a fairly new thing on AMD CPU's (as far as I understand the first AMD CPU to set the cpuid flag was released in mid 2019).
Okay, I admit that I don't know precisely which CPUs era are covered. But even in that case, I'm not sure that it's worth introducing an SSE2 code path.
Although the non-vectorized code is twice or three times slower than what SSE2 could do, it's still 25 times faster than the current code, which IMHO is good enough for most CPUs, and doesn't need specific instructions.
I'm also sure SSE2 (and ERMS as well) have a lot of quirks and performance profile variations across CPU models and I feel like it can be very tricky and a bit worthless to try to finely optimize them.
Yet, I'm only arguing because I felt it was possible to write a good enough implementation in C. I don't mind very much in the end.
On 9/13/21 2:23 PM, Rémi Bernon wrote:
+#ifdef __i386__ + if (n < 2048 && sse2_supported)
if ((n < 2048 && sse2_supported) || !erms_supported)
+#else + if (n < 2048)
if (n < 2048 || !erms_supported)
+#endif + { + __asm__ __volatile__ ( + "movd %1, %%xmm0\n\t" + "pshufd $0, %%xmm0, %%xmm0\n\t" + "test $0x20, %2\n\t" + "je 1f\n\t" + "sub $0x20, %2\n\t" + "movdqa %%xmm0, 0x00(%0,%2)\n\t" + "movdqa %%xmm0, 0x10(%0,%2)\n\t" + "je 2f\n\t" + "1:\n\t" + "sub $0x40, %2\n\t" + "movdqa %%xmm0, 0x00(%0,%2)\n\t" + "movdqa %%xmm0, 0x10(%0,%2)\n\t" + "movdqa %%xmm0, 0x20(%0,%2)\n\t" + "movdqa %%xmm0, 0x30(%0,%2)\n\t" + "ja 1b\n\t" + "2:\n\t" + : + : "r"(d), "r"((uint32_t)v), "c"(n) + : "memory" + );
Shouldn't xmm0 be added to clobbered registers list?
I guess, yes and "cc" for the flags too I suppose.