On 26/08/2020 17:01, Gabriel Ivăncescu wrote:
On 25/08/2020 20:15, Piotr Caban wrote:
On 8/22/20 5:10 PM, Gabriel Ivăncescu wrote:
I understand `rep movsl` is faster even in the first test than `rep movsb`?
No, it was faster in "Non-aligned", "Aligned overlap" and "Non-aligned overlap" tests. In the "Aligned" case the performance was identical no matter if movsb or movsl was used.
I'm also attaching simple sse2 implementation for comparison. It's faster than the previous one on my machine. I'm also attaching results from running the test on Windows (in VM).
Thanks, Piotr
In most cases, the SSE version performs very well, in fact slightly better than the Windows implementation, and does very well for small moves.
Unfortunately, for some reason, it seems it's quite significantly slower (20% or more) only on the "non-overlapped" case. Attached results.
Thanks, Gabriel
Also, sorry I forgot to mention a small thing, is there a reason you're using movdq(a|u) instead of movaps/movups (which are also SSE1 not SSE2)? They have smaller encoding and should very slightly help with the instruction cache, and no CPU cares about floating vs int states when doing only moves. (even if it did, most operations on SSE tend to be for floats anyway, assuming some broken CPU has some false dependency on them, but I doubt it)