On 21/08/2020 20:21, Piotr Caban wrote:
Hi Gabriel,
I was experimenting with various attempts of implementing memmove. I'm attaching a modified version of Paul's test application. It compares memmove performance from ucrtbase, msvcr100 and msvcrt dlls. It also contains assembler (i386) implementation of the function.
Thanks, Piotr
Hi Piotr,
Here are the results on a Haswell Xeon E3-1241 v3 CPU (all 32-bit to compare with your assembly implementation). I've also added an extra test (attached function) that simply uses `rep movsb`.
Quick Summary: Your assembler implementation is very good overall compared to the one from Windows 10 (ucrtbase). The only time it is significantly slower (10%) is in the "aligned non-overlap" case (the first test). In other cases it performs just as well as ucrtbase.
The simple "rep movsb" function I added as a quick test is also faster than your assembly implementation for this case only (aligned, non-overlapped).
However, it is extremely slow in overlap cases, where we copy backwards. I guess the CPU is not optimized for copying backwards with it. On your CPU, I understand `rep movsl` is faster even in the first test than `rep movsb`?
One last thing worth mentioning is "small moves" case: it seems the older runtimes do much better here. I think we can do something separately with those, without using movsb/movsl, which I understand require some startup time from the CPU to do alignment checks and so on before it goes full speed copying at maximum bandwidth.
Here's the entire log:
Test ucrtbase implementation Aligned Elapsed time 2659ms. Non-aligned Elapsed time 3004ms. Aligned overlap Elapsed time 2817ms. Non-aligned overlap Elapsed time 2871ms. src==dst Elapsed time 2345ms. Small moves Elapsed time 310ms. Small moves Elapsed time 313ms. Small moves Elapsed time 308ms. correctness test Elapsed time 2163ms. Test msvcr100 implementation Aligned Elapsed time 3674ms. Non-aligned Elapsed time 2998ms. Aligned overlap Elapsed time 2808ms. Non-aligned overlap Elapsed time 2853ms. src==dst Elapsed time 2397ms. Small moves Elapsed time 115ms. Small moves Elapsed time 196ms. Small moves Elapsed time 328ms. correctness test Elapsed time 2142ms. Test msvcrt implementation Aligned Elapsed time 3669ms. Non-aligned Elapsed time 2967ms. Aligned overlap Elapsed time 2829ms. Non-aligned overlap Elapsed time 2872ms. src==dst Elapsed time 2410ms. Small moves Elapsed time 129ms. Small moves Elapsed time 197ms. Small moves Elapsed time 332ms. correctness test Elapsed time 2168ms. Test assembler implementation Aligned Elapsed time 2940ms. Non-aligned Elapsed time 2985ms. Aligned overlap Elapsed time 2809ms. Non-aligned overlap Elapsed time 2848ms. src==dst Elapsed time 2813ms. Small moves Elapsed time 271ms. Small moves Elapsed time 491ms. Small moves Elapsed time 292ms. correctness test Elapsed time 2156ms. Test rep movsb implementation Aligned Elapsed time 2731ms. Non-aligned Elapsed time 3042ms. Aligned overlap Elapsed time 5910ms. Non-aligned overlap Elapsed time 5910ms. src==dst Elapsed time 5912ms. Small moves Elapsed time 289ms. Small moves Elapsed time 287ms. Small moves Elapsed time 288ms. correctness test Elapsed time 2181ms. done