Wouldn't it make much more sense if we simply copied optimized copy routines from other libc implementations? They have specialised implementations for various architectures and microarchitectures (e.g. cache line size), not to mention the performance enhancements that have accumulated over time.
Also worth noting is that Wine is licensed under LGPL, which makes it compatible with most open-source libcs out there. Basically what we would need is some ABI adaptations, such as calling convention adjustment and SEH.
Another option is to just call system libc routines directly, although in this case it might interfere with stack unwinding, clear PE/unix separation, and msvcrt hotpatching.