I don't think this works. It should be possible to LockRect() after buffer was locked, if you keep surface locked this won't work. I don't think we have tests for that, but that's what quick testing on Windows shows.
We could have a shortcut in MFCopyImage() first, to have a single copy call when strides match, instead of calling per row. Next step could be to have some SIMD variants, with non-temporal copy like docs suggest. No idea how much this improves performance, but for large enough copies it's meant to bypass cache at least, I think.