On 4/19/22 10:39, Rémi Bernon wrote:
On 4/13/22 15:50, Piotr Caban wrote:
Hi Jan,
On 4/6/22 19:14, Jan Sikorski wrote:
-/*********************************************************************
- * memcmp (MSVCRT.@)
- */
-int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n) +static inline int memcmp_unaligned(const void *ptr1, const void *ptr2, size_t n) { const unsigned char *p1, *p2; @@ -2690,6 +2687,64 @@ int __cdecl memcmp(const void *ptr1, const void *ptr2, size_t n) return 0; }
I think it would be good to optimize memcmp_unaligned a little. I'm thinking about something along these lines (untested): static int memcmp_size_t(size_t s1, size_t s2) { const uint8_t *p1 = (const uint8_t*)&s1, *p2 = (const uint8_t*)&s2; while (*p1 == *p2) { p1++; p2++; } return *p1 > *p2 ? 1 : -1; }
You can make the small case branchless (and slightly faster) by converting to big endian then comparing the value at once.
static int memcmp_unaligned(const char *c1, const char *c2, size_t len) { int sh1 = 8 * ((size_t)c2 % sizeof(size_t)); int sh2 = 8 * sizeof(size_t) - sh1; const size_t *s1 = (const size_t*)c1; const size_t *s2 = (const size_t*)(c2 - sh1 / 8); size_t x, y, m;
x = s2[0]; do { y = s2[1]; m = MERGE(x, sh1, y, sh2); if (*s1 != m) return memcmp_size_t(*s1, m); s1++; s2++; len--; x = y; } while (len); return 0; }
Where MERGE is already defined in string.c file, len is length in sizeof(size_t) blocks instead of bytes. It may be even better to switch to uint64_t instead of using size_t like in memset. You can also take a look on glibc platform independent implementation (it uses the MERGE trick + loop unrolling, according to some random benchmark it's on par with your implementation performance wise on i386/x86_64 and is much faster on arm).
Thanks, Piotr
FWIW for me and with a basic benchmark this version is 3x slower than Jan's version on 32bit x86, 2x slower on 64bit.
To make sure we're talking about the same thing - I've meant that glibc's (platform independent) version has similar performance. The code I have attached here was only added to show general idea while keeping the code as simple as possible.