On Sat, Apr 2, 2022 at 11:09 PM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
It's not a real syscall per se; rather, it's more like a gate between the PE side (corresponding to Windows userspace) and the Unix side (Wine's pseudo kernel space which interacts directly with the host OS). The PE/Unix separation is designed so that every interaction with the system goes to the syscall gate, just like on Windows (we're not there yet, but we'll eventually). This helps satisfy video game anti-cheat technologies and conceal the Unix (.so) code which would otherwise cause confusion for Win32 apps and debuggers tracing the execution path.
Ah. That makes sense. In this case I think Remi is correct that there's too much overhead.
I can't speak definitively, because it looks a little different for every function. But, overwhelmingly, my experience has been that nothing will run measurably faster than byte-by-byte functions without using vector instructions. Because the bottleneck isn't CPU power, the bottleneck is memory access.
It should be.
It's a margin of ~25%, versus a margin of ~500%. Unless you're moving gigabytes it's unlikely to be noticeable.
That said, another confounding issue is the fact that a large number of small movements will have very different performance characteristics from a small number of large movements. It's possible there are cases where using, say, dwords would be much faster than trying to vectorize. I haven't found them in testing, but this is another argument for using someone else's code rather than trying to roll our own - because a library dedicated to this purpose has likely done all kinds of profiling to find exactly where that threshold lies.
What you're thinking of is a SIMD abstraction library. I don't see how it would be highly necessary, since we're okay with vendor-specific code blocks as long as they are justified. Note that we now only support 4 architectures (IA-32, x86-64, ARM AArch32, and ARM AArch64).
Right. The reason I bring it up is because it would satisfy the requirement to be portable (as long as you stick to the abstraction library, you're writing regular C) and would get you close enough to the performance of real intrinsics that it should leave no need for inline asm. So if we don't want to import another library, this may be the best compromise between speed and simplicity.