Re: [PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

3 Apr 2022


      On Sun, Apr 3, 2022, 11:36 AM Elaine Lefler elaineclefler@gmail.com wrote:
...
On Sat, Apr 2, 2022 at 4:51 AM Jin-oh Kang jinoh.kang.kr@gmail.com
wrote:
...
Wouldn't it make much more sense if we simply copied optimized copy
routines from other libc implementations? They have specialised
implementations for various architectures and microarchitectures (e.g.
cache line size), not to mention the performance enhancements that have
accumulated over time.
I think this is a really good point.
...
Another option is to just call system libc routines directly, although
in this case it might interfere with stack unwinding, clear PE/unix
separation, and msvcrt hotpatching.
...
Also a good idea, but the problem is that Windows dlls expect Windows
calling conventions. There's no way (at least none I can immediately
find) of wrapping a call to the system library without crashing.
It should of course move around argument registers and deal with
caller/callee-saved registers; this is implied in "some ABI adaptations,
such as calling convention adjustment and SEH."
...
On Sat, Apr 2, 2022 at 5:19 AM Rémi Bernon rbernon@codeweavers.com
wrote:
...
Calling the system libc will need a "syscall", and will most likely
defeat any performance improvement it could bring.
I don't think that works either, since these functions live in an .so
and not in the kernel. Now, if it were possible, the system libraries
are _significantly_ faster than anything Wine offers (even with SSE2
optimizations), so I think their raw speed would make up for any
overhead.
It's not a real syscall per se; rather, it's more like a gate between the
PE side (corresponding to Windows userspace) and the Unix side (Wine's
pseudo kernel space which interacts directly with the host OS). The PE/Unix
separation is designed so that every interaction with the system goes to
the syscall gate, just like on Windows (we're not there yet, but we'll
eventually). This helps satisfy video game anti-cheat technologies and
conceal the Unix (.so) code which would otherwise cause confusion for Win32
apps and debuggers tracing the execution path.
...
On Sat, Apr 2, 2022 at 7:24 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
...
As long as correctness and (any sort of) performance advantages are
preserved, no further maintenance effort would be _strictly_ necessary.
...
Agree with this. It's not terribly difficult to prove their
correctness. Once that's done you should never need to update them. A
new architecture might introduce instructions that are even more
performant, but I don't think it's conceivable that vector
instructions would ever become slower than non-vectors. Doing so would
cripple ~15 years of software development, nobody would buy a CPU that
does that.
Here's how I see it: vector instructions were created specifically to
solve this problem of operating on large regions of memory very
quickly. Nearly every other program with similar requirements is
either 1) Using these instructions, or 2) Relying on an external
library that does so (note: that library is often msvcrt!). So I think
Wine should do one of those two as well.
Also worth noting is that Wine already does it, with SSE2 memcpy and etc..
...
On Sat, Apr 2, 2022 at 8:59 AM Piotr Caban piotr.caban@gmail.com wrote:
...
On 4/2/22 13:19, Rémi Bernon wrote:
...
(I personally, believe that the efficient C implementation should come
first, so that any non-supported hardware will at least benefit from
it)
...
I also think that it will be good to add more efficient C implementation
first (it will also show if SSE2 implementation is really needed).
Thanks,
Piotr
I can't speak definitively, because it looks a little different for
every function. But, overwhelmingly, my experience has been that
nothing will run measurably faster than byte-by-byte functions without
using vector instructions. Because the bottleneck isn't CPU power, the
bottleneck is memory access.
It should be.
Like I said, vectors were created
...
specifically to solve this problem, and IME you won't find notable
performance gains without using them.
I think Rémi is aware of that. However, optimization on C implementation is
arguably much more universally applicable to a broader range of
(micro-)architectures.
...
Now, we CAN use #ifdefs and preprocessor macros to define a fake
__m128i on systems that don't natively support it. Then write
emulation for each operation so that GCC can compile real vector
instructions when possible (x86-64) and fallback to smaller types on
systems without vector support. That way we'd avoid large
vendor-specific code blocks. But you're not going to escape this idea
of "we need to read large chunks and operate on them all at once".
What you're thinking of is a SIMD abstraction library. I don't see how it
would be highly necessary, since we're okay with vendor-specific code
blocks as long as they are justified. Note that we now only support 4
architectures (IA-32, x86-64, ARM AArch32, and ARM AArch64).
...
Personally I think Jinoh's suggestion to find a compatible-licensed
library and copy their code is best. Otherwise I sense this will
become an endless circle of "do we really need it?" (yes, but this
type of code is annoying to review) and Wine could benefit from using
an implementation that's already widely-tested.

Elaine

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.