On 8/26/20 5:19 PM, Gabriel Ivăncescu wrote:
Also, sorry I forgot to mention a small thing, is there a reason you're using movdq(a|u) instead of movaps/movups (which are also SSE1 not SSE2)? They have smaller encoding and should very slightly help with the instruction cache, and no CPU cares about floating vs int states when doing only moves. (even if it did, most operations on SSE tend to be for floats anyway, assuming some broken CPU has some false dependency on them, but I doubt it)
I was considering adding a "fast path" that uses movntdq for large moves. It generally speeds things up if whole dest buffer doesn't fit into cache. The movntdq has no SSE1 equivalent and I didn't want to mix SSE1 and SSE2 instructions (I was also planning to guard this code with sse2_enabled variable so it will not run on non-SSE2 capable hardware (see _set_SSE2_enable function)).
I'm still not sure how the final implementation will look like. Maybe it will make sense to use movaps instead.
Thanks, Piotr