Dec. 2, 2022
8:54 p.m.
On 12/2/22 14:39, Paul Gofman wrote: > On 12/2/22 14:32, Zebediah Figura (@zfigura) wrote: >> On Fri Dec 2 20:30:47 2022 +0000, **** wrote: >>> Paul Gofman replied on the mailing list: >>> ``` >>> On 12/2/22 14:25, Gabriel Ivăncescu (@insn) wrote: >>>> On Fri Dec 2 18:57:30 2022 +0000, Jacek Caban wrote: >>>>>> This should help a bit more, does it make a difference for you? >>>>> My previous test wasn't really good for measuring it. >>>>> I hacked a micro-benchmark, which confirms that the patch improves >>>>> performance a lot. It was visible when doing "real" Vulkan >>>>> vkGetPhysicalDeviceProperties calls in a loop, but even cleaner when I >>>>> changed it further to make Unix side to be no-op. It closes most of >>>>> the >>>>> gap between direct call and __wine_unix_call_dispatcher. Times >>>>> recorded >>>>> for no-op calls: >>>>> - direct call: 5761 >>>>> - unpatched Wine: 13933 >>>>> - ret.diff: 6823 (55% time spent in __wine_unix_call_dispatcher, >>>>> 29% in >>>>> PE vkGetPhysicalDeviceProperties) >>>>> Looks impressive! >>>> @gofman This isn't about setting it in rcx or not, it's about >>> mispairing `call`s and `ret`s, which basically means 100% mispredicted >>> because CPUs are optimized for it, so it couldn't do any speculative >>> execution past the return before. >>> Yes, I figured that much. Yet the attached diff removes the return >>> address from rcx in wine_syscall_dispatcher(), so I thought it makes >>> sense to note that it will break things. >>> ``` >> Would it help to return to the return address already on the PE stack? >> > I am sorry, am not sure if I understand... help perf or help anticheat, > and how return address on PE stack is related? Also note that: > > - ret address in rcx relates to wine_syscall_dispatcher only, not > __wine_unix_call_dispatcher, while it is __wine_unix_call_dispatcher is > of the performance concern here; > > - I guess that moving the ret address to rcx and push rcx / ret might be > the same performance-wise as pushq 0x70(%rcx), ret. > > > I mean something like the attached patch. I don't know enough about modern x86 optimization to know if it would help, but it seems like it would at least avoid a memory access?