Re: [PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

2 Apr 2022


      On 4/2/22 12:51, Jin-oh Kang wrote:
...
Wouldn't it make much more sense if we simply copied optimized copy
routines from other libc implementations? They have specialised
implementations for various architectures and microarchitectures (e.g.
cache line size), not to mention the performance enhancements that have
accumulated over time.
The question is, do we really need and want the complexity induced by 
hand-crafted assembly (or intrinsics) routines?
* at build time but also runtime, we'll need to carefully check hardware 
capability,
* it increases maintenance burden as they may need to be updated when 
hardware performance profile changes, or when new features are added,
* other libc implementation may be hard to integrate in our code base, 
especially if they rely on some dispatch mechanism or assembly source,
Or do we want to rely as much as possible on the compiler to do it for us?
I don't know the rationale behind the choice of the other libc, but as 
far as I understand for Wine an efficient C implementation is usually 
preferred over assembly, unless a convincing argument is made that doing 
it in assembly significantly improves things for some applications.
(I personally, believe that the efficient C implementation should come 
first, so that any non-supported hardware will at least benefit from it)
...
Also worth noting is that Wine is licensed under LGPL, which makes it
compatible with most open-source libcs out there. Basically what we would
need is some ABI adaptations, such as calling convention adjustment and SEH.
Another option is to just call system libc routines directly, although in
this case it might interfere with stack unwinding, clear PE/unix
separation, and msvcrt hotpatching.
Calling the system libc will need a "syscall", and will most likely 
defeat any performance improvement it could bring.
...
On Sat, Apr 2, 2022, 1:45 PM Elaine Lefler elaineclefler@gmail.com wrote:
...
On Fri, Apr 1, 2022 at 7:13 AM Jan Sikorski jsikorski@codeweavers.com
wrote:
...
Signed-off-by: Jan Sikorski jsikorski@codeweavers.com
It's about 13x faster on my machine than the byte version.
memcmp performance is important to wined3d, where it's used to find
pipelines in the cache, and the keys are pretty big.
Should be noted that SSE2 also exists on 32-bit processors, and in
this same file you can find usage of "sse2_supported", which would
enable you to use this code path on i386. You can put
__attribute__((target("sse2"))) on the declaration of sse2_memcmp to
allow GCC to emit SSE2 instructions even when the file's architecture
forbids it.
I think this could be even faster if you forced ptr1 to be aligned by
byte-comparing up to ((p1 + 15) & ~15) at the beginning. Can't
reasonably force-align both pointers, but aligning at least one should
give measurably better performance.
I have a similar patch (labelled 230501 on
https://source.winehq.org/patches/ - not sure how to link the whole
discussion, sorry) which triggered a discussion about duplication
between ntdll and msvcrt. memcmp is also a function that appears in
both dlls. Do you have any input on that? (sorry if I'm out of line
for butting in here. I just noticed we're working on the same basic
thing)

Elaine

-- 
Rémi Bernon rbernon@codeweavers.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH] msvcrt: SSE2 implementation of memcmp for x86_64.

Signed-off-by: Jan Sikorski jsikorski@codeweavers.com