I don't feel I have enough experience with maintaining larger assembly code chunks to be comfortable with asking for specific suggestions, or give a confident review / approval here. I'll stick to some general comments of uncertain value. I'm not sure that going to those lengths to have a single asm version for both 32 and 64 bit x86 helps more than it hurts. Those defines make it work, but don't look super nice e.g. the first line in the asm for 64-bit ends up being `mov %ecx, %ecx`. Just splitting the two versions apart might help. Otherwise using compiler intrinsics could be an option. I guess it would leave some performance on the table, by handing register allocation to the compiler, but maybe not that much? If I had to write it myself now I'd probably write it in the style of https://gitlab.winehq.org/wine/wine/-/merge_requests/9588/diffs?commit_id=2a... but with some comments. FWIW the SSE version of upsample seems to be generally doing what it's supposed to. I haven't looked into it at the level of detail I'd like to. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/10716#note_137936