On Tue, 14 Sep 2021, Rémi Bernon wrote:
For n larger than 16 we store 16 bytes on each end of the buffer, eventually overlapping, and then 16 additional bytes for n > 32.
Then we can find a 32-byte aligned range overlapping the remaining part of the destination buffer, which is filled 32 bytes at a time in a loop.
Signed-off-by: Rémi Bernon rbernon@codeweavers.com
dlls/msvcrt/string.c | 60 +++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 57 insertions(+), 3 deletions(-)
- volatile unsigned char *d = dst; /* avoid gcc optimizations */
- while (n--) *d++ = c;
- uint64_t v = 0x101010101010101ull * (unsigned char)c;
- unsigned char *d = (unsigned char *)dst;
- size_t a = 0x20 - ((uintptr_t)d & 0x1f);
- if (n >= 16)
- {
*(uint64_t *)(d + 0) = v;
*(uint64_t *)(d + 8) = v;
*(uint64_t *)(d + n - 16) = v;
*(uint64_t *)(d + n - 8) = v;
FYI this broke memset on ARM (32 bit) due to misalignment. ARM used to be quite alignment-picky in older versions, but since ARMv7, 32 bit register loads/stores can be unaligned. For 64 bit writes, there's an instruction STRD, which can't be used unaligned though, but in these cases, the compiler is free to use it.
The surprising thing about STRD is that it only requires 32 bit alignment, even if it writes 64 bit. First I tried to replace
*(uint64_t *)(d + 0) = v;
with
*(uint32_t *)(d + 0) = v; *(uint32_t *)(d + 4) = v;
hoping to use 32 bit stores (which work unaligned). However, after casting to uint32_t*, the compiler is free to assume that the resulting pointer is 32 bit aligned, and STRD only requires 32 bit alignment, so the compiler can still fuse these two stores into one single STRD.
By using
*(volatile uint32_t *)(d + 0) = v; *(volatile uint32_t *)(d + 4) = v;
the compiler emits them as two separate 32 bit stores though (which work fine with any alignment).
I'll send a PoC patch that fixes things for me.
// Martin