Would it help to return to the return address already on the PE stack?
Are we sure it's never clobbered?
I guess that moving the ret address to rcx and push rcx / ret might be
the same performance-wise as pushq 0x70(%rcx), ret.
Yes, skipping rcx save will break existing tests.