On 8/21/20 1:51 PM, Gabriel Ivăncescu wrote:
FWIW "rep movsb" is supposedly the fastest when transferring larger blocks (I think more than 128 bytes?) on recent CPUs. The cool thing is that the CPU handles everything, no matter the alignment or "memcpy vs memmove", so it's by far the simplest, and since it knows about the alignment requirements of that particular CPU it can optimize it internally itself.
Same story with "rep stosb" for memset. Unfortunately these are very slow on older CPUs. I think there's a CPUID flag that says whether they are fast, we could use that.
"rep movsb" is ~3 times slower than "rep movl" on my cpu (AMD Ryzen 7 2700X). Maybe the single byte variant is better optimized on Intel cpus.
Also the "rep movl" implementation is still almost ~2 times slower than memmove from glibc (tested on 64MB data blocks).
Thanks, Piotr