On 8/22/20 5:10 PM, Gabriel Ivăncescu wrote:
I understand `rep movsl` is faster even in the first test than `rep movsb`?
No, it was faster in "Non-aligned", "Aligned overlap" and "Non-aligned overlap" tests. In the "Aligned" case the performance was identical no matter if movsb or movsl was used.
I'm also attaching simple sse2 implementation for comparison. It's faster than the previous one on my machine. I'm also attaching results from running the test on Windows (in VM).
Thanks, Piotr