Signed-off-by: Jacek Caban jacek@codeweavers.com --- dlls/ntdll/unix/signal_x86_64.c | 32 ++++++++++++++++++++++++++------ 1 file changed, 26 insertions(+), 6 deletions(-)
On 1/22/21 18:51, Jacek Caban wrote:
Signed-off-by: Jacek Caban jacek@codeweavers.com
dlls/ntdll/unix/signal_x86_64.c | 32 ++++++++++++++++++++++++++------ 1 file changed, 26 insertions(+), 6 deletions(-)
This (together with saving all the basice and XMM registers) looks like a big overhead on every Nt function call. Is it maybe possible to do that when explicitly requested only (some option)?
I think we still support processors which don't have AVX and thus don't have xsave instruction (which is reported as a separate cpuid bit).
Also, to save at least this part, it is possible to use xsavec which won't be saving anything (aside from the mask) if the ymm high part is zero (that is, in initial state, which is quite the common case when ymm regs were not used before the call; compilers even tend to reset higher part of ymm when done with them). There is user_shared_data->XState.EnabledFeatures which tells if xsave supported at all and user_shared_data->XState.CompactionEnabled tells if xsavec is available.
Hi Paul,
On 22.01.2021 17:21, Paul Gofman wrote:
On 1/22/21 18:51, Jacek Caban wrote:
Signed-off-by: Jacek Caban jacek@codeweavers.com
dlls/ntdll/unix/signal_x86_64.c | 32 ++++++++++++++++++++++++++------ 1 file changed, 26 insertions(+), 6 deletions(-)
This (together with saving all the basice and XMM registers) looks like a big overhead on every Nt function call. Is it maybe possible to do that when explicitly requested only (some option)?
If you mean an user configurable option, I do not think that's the right direction (just like any BreakXXXFeature option). I'd rather optimize this solution to make sure that its performance is acceptable. This series is what I considered good enough for the first iteration.
We may want an extension enabling debugger to get a context inside a syscall. I've been thinking about a flag in PEB that winedbg could set when desired.
I think we still support processors which don't have AVX and thus don't have xsave instruction (which is reported as a separate cpuid bit).
xsave is part of SSE2, not AVX, and it should ignore unsupported requested features, so the patch should be fine as is on hardware without AVX. xsave needs, however, to be enabled by OS, so we may need a feature check if we want to support OSes without xsave enabled.
Also, to save at least this part, it is possible to use xsavec which won't be saving anything (aside from the mask) if the ymm high part is zero (that is, in initial state, which is quite the common case when ymm regs were not used before the call; compilers even tend to reset higher part of ymm when done with them). There is user_shared_data->XState.EnabledFeatures which tells if xsave supported at all and user_shared_data->XState.CompactionEnabled tells if xsavec is available.
If I read documentation right (and my testing confirms that), xsave does what you described as well. It will not store high ymm part if it's in initial state. The difference between xsave and xsavec is about storage format, but that doesn't make a difference here. xsavec is also not exactly free: in an addition to feature check, it also requires entire xsave header to be initialized.
What seems to be more interesting is xsaveopt, which I think could make a difference. That would, however, need xsave are to be at constant address. I've been thinking about storing it next to TEB, but we can't do that as long as winsock is called on signal stack, so I left experimenting with it for the future.
Thanks,
Jacek
On 1/22/21 20:18, Jacek Caban wrote:
Hi Paul,
I think we still support processors which don't have AVX and thus don't have xsave instruction (which is reported as a separate cpuid bit).
xsave is part of SSE2, not AVX, and it should ignore unsupported requested features, so the patch should be fine as is on hardware without AVX. xsave needs, however, to be enabled by OS, so we may need a feature check if we want to support OSes without xsave enabled.
There was a real bug with that: https://bugs.winehq.org/show_bug.cgi?id=50271.
Also, to save at least this part, it is possible to use xsavec which won't be saving anything (aside from the mask) if the ymm high part is zero (that is, in initial state, which is quite the common case when ymm regs were not used before the call; compilers even tend to reset higher part of ymm when done with them). There is user_shared_data->XState.EnabledFeatures which tells if xsave supported at all and user_shared_data->XState.CompactionEnabled tells if xsavec is available.
If I read documentation right (and my testing confirms that), xsave does what you described as well. It will not store high ymm part if it's in initial state. The difference between xsave and xsavec is about storage format, but that doesn't make a difference here. xsavec is also not exactly free: in an addition to feature check, it also requires entire xsave header to be initialized.
What seems to be more interesting is xsaveopt, which I think could make a difference. That would, however, need xsave are to be at constant address. I've been thinking about storing it next to TEB, but we can't do that as long as winsock is called on signal stack, so I left experimenting with it for the future.
xsavec also performs an optimization (doesn't save the xstate in initial state), and I think it should not depend on the save address
On 22.01.2021 18:44, Paul Gofman wrote:
What seems to be more interesting is xsaveopt, which I think could make a difference. That would, however, need xsave are to be at constant address. I've been thinking about storing it next to TEB, but we can't do that as long as winsock is called on signal stack, so I left experimenting with it for the future.
xsavec also performs an optimization (doesn't save the xstate in initial state),
As I said, xsave already does that optimization, there is no need for xsavec for that.
and I think it should not depend on the save address
Constant save address is for a different optimization: xsaveopt does not need to store parts that didn't change since the last xrstor. It means that it would save a lot of sores in applications rarely using AVX that xsave/xsavec can't optimize.
Jacek
On 1/22/21 20:51, Jacek Caban wrote:
On 22.01.2021 18:44, Paul Gofman wrote:
What seems to be more interesting is xsaveopt, which I think could make a difference. That would, however, need xsave are to be at constant address. I've been thinking about storing it next to TEB, but we can't do that as long as winsock is called on signal stack, so I left experimenting with it for the future.
xsavec also performs an optimization (doesn't save the xstate in initial state),
As I said, xsave already does that optimization, there is no need for xsavec for that.
Are you sure? As I read the docs, there is no mention of xsave avoiding any saves (e. g., [1], [2]), and I probably even tested that (well, not 100% sure now_ xsave zeros the data while xsavec (expectedly) does not.:
From 2 (savec):
TO_BE_SAVED ← RFBM AND XINUSE;
From 1:
Only "RFBM ← XCR0 AND EDX:EAX;" is used as a flag controlling whether save is performed or not.
On 22.01.2021 19:01, Paul Gofman wrote:
On 1/22/21 20:51, Jacek Caban wrote:
On 22.01.2021 18:44, Paul Gofman wrote:
What seems to be more interesting is xsaveopt, which I think could make a difference. That would, however, need xsave are to be at constant address. I've been thinking about storing it next to TEB, but we can't do that as long as winsock is called on signal stack, so I left experimenting with it for the future.
xsavec also performs an optimization (doesn't save the xstate in initial state),
As I said, xsave already does that optimization, there is no need for xsavec for that.
Are you sure? As I read the docs, there is no mention of xsave avoiding any saves (e. g., [1], [2]), and I probably even tested that (well, not 100% sure now_ xsave zeros the data while xsavec (expectedly) does not.:
From 2 (savec):
TO_BE_SAVED ← RFBM AND XINUSE;
From 1:
Only "RFBM ← XCR0 AND EDX:EAX;" is used as a flag controlling whether save is performed or not.
Oh, right, I was confused by the fact that it does not set the bit in mask:
XSTATE_BV field in XSAVE header ← (OLD_BV AND ~RFBM) OR (XINUSE AND RFBM);
So it may be interesting if xsaveopt doesn't work out.
Jacek
On 22.01.2021 18:44, Paul Gofman wrote:
xsave is part of SSE2, not AVX, and it should ignore unsupported requested features, so the patch should be fine as is on hardware without AVX. xsave needs, however, to be enabled by OS, so we may need a feature check if we want to support OSes without xsave enabled.
There was a real bug with that: https://bugs.winehq.org/show_bug.cgi?id=50271.
Looking at the bug report, I think it was about missing AVX support, not missing XSAVE support, so it should be fine with my patches.
I would need to do testing to be sure, but I noticed that our feature detection looks suspicious. I would expect that we should do something like the attached patch (but if it's right, it would also need fixes in ntdll).
Jacek
On 1/22/21 21:03, Jacek Caban wrote:
On 22.01.2021 18:44, Paul Gofman wrote:
xsave is part of SSE2, not AVX, and it should ignore unsupported requested features, so the patch should be fine as is on hardware without AVX. xsave needs, however, to be enabled by OS, so we may need a feature check if we want to support OSes without xsave enabled.
There was a real bug with that: https://bugs.winehq.org/show_bug.cgi?id=50271.
Looking at the bug report, I think it was about missing AVX support, not missing XSAVE support, so it should be fine with my patches.
It was faulting exactly on xrstor instruction, with the mask being taken from what Linux gives us in its xsave structure. I didn't check the mask that time, so I can't fully exclude the possibility that it was the wrong mask somehow and it wouldn't fail with a zero mask, but I got an impression that it is xrstor instruction itself is unsupported. And apparently there wasn't cpuid XSAVE flag set, that's how the fix there is helping. Also, xsave instruction documentation explicitly refers XSAVE CPUID feature flag.
On 22.01.2021 19:20, Paul Gofman wrote:
On 1/22/21 21:03, Jacek Caban wrote:
On 22.01.2021 18:44, Paul Gofman wrote:
xsave is part of SSE2, not AVX, and it should ignore unsupported requested features, so the patch should be fine as is on hardware without AVX. xsave needs, however, to be enabled by OS, so we may need a feature check if we want to support OSes without xsave enabled.
There was a real bug with that: https://bugs.winehq.org/show_bug.cgi?id=50271.
Looking at the bug report, I think it was about missing AVX support, not missing XSAVE support, so it should be fine with my patches.
It was faulting exactly on xrstor instruction, with the mask being taken from what Linux gives us in its xsave structure. I didn't check the mask that time, so I can't fully exclude the possibility that it was the wrong mask somehow and it wouldn't fail with a zero mask, but I got an impression that it is xrstor instruction itself is unsupported. And apparently there wasn't cpuid XSAVE flag set, that's how the fix there is helping. Also, xsave instruction documentation explicitly refers XSAVE CPUID feature flag.
It's mentioned in exceptions section of documentation that you linked:
"#UD If CPUID.01H:ECX.XSAVE[bit 26] = 0."
There is no AVX check. (There is also If "CR4.OSXSAVE[bit 18] = 0.", which is how OS may decide to support it or not).
I'm not sure why you conclude that XSAVE CPUID flag wasn't set in case of the bug. With current code, both AVX and XSAVE flags to set EnabledFeatures, so it wouldn't be set even if XSAVE is supported but AVX not.
Jacek
[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-3...
On 22/01/2021 17:21, Paul Gofman wrote:
I think we still support processors which don't have AVX and thus don't have xsave instruction (which is reported as a separate cpuid bit).
Looking closer at this, we indeed need a feature check here. I will work on a new version (patches 1-11 in the series should not be affected).
Thanks,
Jacek
Am Samstag, 23. Jänner 2021, 14:40:08 EAT schrieb Jacek Caban:
On 22/01/2021 17:21, Paul Gofman wrote:
I think we still support processors which don't have AVX and thus don't have xsave instruction (which is reported as a separate cpuid bit).
Looking closer at this, we indeed need a feature check here. I will work on a new version (patches 1-11 in the series should not be affected).
I am curious if you have tested the performance impact of this. While I agree that this chance is the right thing to do it would be nice to know the downsides.
On 24.01.2021 11:54, Stefan Dösinger wrote:
Am Samstag, 23. Jänner 2021, 14:40:08 EAT schrieb Jacek Caban:
On 22/01/2021 17:21, Paul Gofman wrote:
I think we still support processors which don't have AVX and thus don't have xsave instruction (which is reported as a separate cpuid bit).
Looking closer at this, we indeed need a feature check here. I will work on a new version (patches 1-11 in the series should not be affected).
I am curious if you have tested the performance impact of this. While I agree that this chance is the right thing to do it would be nice to know the downsides.
It's not exactly clear to me what results you'd like to see. This is a similar operation that Windows has to do in its syscalls, so real applications already take that into account and avoid unneeded syscalls on hot paths. That leaves us with micro benchmarks. I came out with the attached benchmark, which tries to show the impact on three types of Nt* functions in Wine. It calls NtQueryInformationProcess with different arguments. Depending on the argument:
- ProcessIoCounters: Wine quickly returns some data. This is a typical thing that stubs do, but some implemented functions are like that as well.
- ProcessVmCounters: Wine does some stuff on client side, including Linux syscalls, to do its work.
- ProcessBasicInformation: Wine uses a server call to implement it.
Here are my averaged results of a few runs, but I really don't want to read too much out of it. I originally planned to send result of a random run, but it showed that patched Wine is notably faster on server calls, so the variation was higher than the impact:
Current Wine: 310 17692 4748
Patched Wine: 2910 18243 4898
For the patched version, I used my local tree which has this series with additional runtime cpuid checks to use fxsave/xsavec/xsave depending on CPU capabilities. As expected, the impact on plain stub call is large, but compared to a real load the the impact seems marginal.
For comparison, Windows results are something like 2200, 140, 140.
Jacek
Am 25.01.2021 um 22:05 schrieb Jacek Caban jacek@codeweavers.com:
It's not exactly clear to me what results you'd like to see. This is a similar operation that Windows has to do in its syscalls, so real applications already take that into account and avoid unneeded syscalls on hot paths. That leaves us with micro benchmarks. I came out with the attached benchmark, which tries to show the impact on three types of Nt* functions in Wine. It calls NtQueryInformationProcess with different arguments. Depending on the argument:
ProcessIoCounters: Wine quickly returns some data. This is a typical thing that stubs do, but some implemented functions are like that as well.
ProcessVmCounters: Wine does some stuff on client side, including Linux syscalls, to do its work.
ProcessBasicInformation: Wine uses a server call to implement it.
Here are my averaged results of a few runs, but I really don't want to read too much out of it. I originally planned to send result of a random run, but it showed that patched Wine is notably faster on server calls, so the variation was higher than the impact:
Current Wine: 310 17692 4748
Patched Wine: 2910 18243 4898
For the patched version, I used my local tree which has this series with additional runtime cpuid checks to use fxsave/xsavec/xsave depending on CPU capabilities. As expected, the impact on plain stub call is large, but compared to a real load the the impact seems marginal.
What I had in mind was running any kind of game benchmark to see if it has a noticeable impact, but I think your microbenchmark largely rules that out - thanks for looking into that. I am getting concerned that we're replacing something that used to be a regular call with a way more complicated process. Though I guess that's OK for ntdll, where applications expect expensive syscalls. We'd have to think twice about applying the same kind of syscall thunks for e.g. GL calls.
For comparison, Windows results are something like 2200, 140, 140.
I am surprised Windows makes syscalls cheaper than our original call-based Wine stub. Something seems odd.
On Tue, Jan 26, 2021 at 01:10:05PM +0300, Stefan Dösinger wrote:
Am 25.01.2021 um 22:05 schrieb Jacek Caban jacek@codeweavers.com:
It's not exactly clear to me what results you'd like to see. This is a similar operation that Windows has to do in its syscalls, so real applications already take that into account and avoid unneeded syscalls on hot paths. That leaves us with micro benchmarks. I came out with the attached benchmark, which tries to show the impact on three types of Nt* functions in Wine. It calls NtQueryInformationProcess with different arguments. Depending on the argument:
ProcessIoCounters: Wine quickly returns some data. This is a typical thing that stubs do, but some implemented functions are like that as well.
ProcessVmCounters: Wine does some stuff on client side, including Linux syscalls, to do its work.
ProcessBasicInformation: Wine uses a server call to implement it.
Here are my averaged results of a few runs, but I really don't want to read too much out of it. I originally planned to send result of a random run, but it showed that patched Wine is notably faster on server calls, so the variation was higher than the impact:
Current Wine: 310 17692 4748
Patched Wine: 2910 18243 4898
For the patched version, I used my local tree which has this series with additional runtime cpuid checks to use fxsave/xsavec/xsave depending on CPU capabilities. As expected, the impact on plain stub call is large, but compared to a real load the the impact seems marginal.
What I had in mind was running any kind of game benchmark to see if it has a noticeable impact, but I think your microbenchmark largely rules that out - thanks for looking into that. I am getting concerned that we're replacing something that used to be a regular call with a way more complicated process. Though I guess that's OK for ntdll, where applications expect expensive syscalls. We'd have to think twice about applying the same kind of syscall thunks for e.g. GL calls.
For comparison, Windows results are something like 2200, 140, 140.
I am surprised Windows makes syscalls cheaper than our original call-based Wine stub. Something seems odd.
According to the attached code, the first loop is done ten times more than the others, so you should probably divide numbers in the first column by ten.
Huw.
On 26.01.2021 11:19, Huw Davies wrote:
On Tue, Jan 26, 2021 at 01:10:05PM +0300, Stefan Dösinger wrote:
Am 25.01.2021 um 22:05 schrieb Jacek Caban jacek@codeweavers.com:
It's not exactly clear to me what results you'd like to see. This is a similar operation that Windows has to do in its syscalls, so real applications already take that into account and avoid unneeded syscalls on hot paths. That leaves us with micro benchmarks. I came out with the attached benchmark, which tries to show the impact on three types of Nt* functions in Wine. It calls NtQueryInformationProcess with different arguments. Depending on the argument:
ProcessIoCounters: Wine quickly returns some data. This is a typical thing that stubs do, but some implemented functions are like that as well.
ProcessVmCounters: Wine does some stuff on client side, including Linux syscalls, to do its work.
ProcessBasicInformation: Wine uses a server call to implement it.
Here are my averaged results of a few runs, but I really don't want to read too much out of it. I originally planned to send result of a random run, but it showed that patched Wine is notably faster on server calls, so the variation was higher than the impact:
Current Wine: 310 17692 4748
Patched Wine: 2910 18243 4898
For the patched version, I used my local tree which has this series with additional runtime cpuid checks to use fxsave/xsavec/xsave depending on CPU capabilities. As expected, the impact on plain stub call is large, but compared to a real load the the impact seems marginal.
What I had in mind was running any kind of game benchmark to see if it has a noticeable impact, but I think your microbenchmark largely rules that out - thanks for looking into that. I am getting concerned that we're replacing something that used to be a regular call with a way more complicated process. Though I guess that's OK for ntdll, where applications expect expensive syscalls. We'd have to think twice about applying the same kind of syscall thunks for e.g. GL calls.
For comparison, Windows results are something like 2200, 140, 140.
I am surprised Windows makes syscalls cheaper than our original call-based Wine stub. Something seems odd.
According to the attached code, the first loop is done ten times more than the others, so you should probably divide numbers in the first column by ten.
Yes, our stubs were too fast to measure otherwise.
I took another look at those results to see where the majority of the difference comes from. I found that my xsavec had a very bad effect on performance due to additional jumps it needed. With a hacked version that avoids additional jumps, the result of patched Wine is about 1800 (instead of 2910 previously observed). We may want separated dedicated dispatchers to achieve that without compromising compatibility.
I also looked at how particular parts of the patchset affect performance by incrementally applying the patchset. Out of the difference (1800 - 310), it's:
- store xmm: 9%
- store other int/control registries: 1%
- restore control registers: 15%
- restore xmm: 28%
- store ymm: 32%
- restore ymm: 14%
Jacek
On 1/26/21 13:10, Stefan Dösinger wrote:
Am 25.01.2021 um 22:05 schrieb Jacek Caban jacek@codeweavers.com:
It's not exactly clear to me what results you'd like to see. This is a similar operation that Windows has to do in its syscalls, so real applications already take that into account and avoid unneeded syscalls on hot paths. That leaves us with micro benchmarks. I came out with the attached benchmark, which tries to show the impact on three types of Nt* functions in Wine. It calls NtQueryInformationProcess with different arguments. Depending on the argument:
ProcessIoCounters: Wine quickly returns some data. This is a typical thing that stubs do, but some implemented functions are like that as well.
ProcessVmCounters: Wine does some stuff on client side, including Linux syscalls, to do its work.
ProcessBasicInformation: Wine uses a server call to implement it.
Here are my averaged results of a few runs, but I really don't want to read too much out of it. I originally planned to send result of a random run, but it showed that patched Wine is notably faster on server calls, so the variation was higher than the impact:
Current Wine: 310 17692 4748
Patched Wine: 2910 18243 4898
For the patched version, I used my local tree which has this series with additional runtime cpuid checks to use fxsave/xsavec/xsave depending on CPU capabilities. As expected, the impact on plain stub call is large, but compared to a real load the the impact seems marginal.
What I had in mind was running any kind of game benchmark to see if it has a noticeable impact, but I think your microbenchmark largely rules that out - thanks for looking into that. I am getting concerned that we're replacing something that used to be a regular call with a way more complicated process. Though I guess that's OK for ntdll, where applications expect expensive syscalls. We'd have to think twice about applying the same kind of syscall thunks for e.g. GL calls.
For comparison, Windows results are something like 2200, 140, 140.
I am surprised Windows makes syscalls cheaper than our original call-based Wine stub. Something seems odd.
Our functions (even stubs with debug trace) currently tend to do (often redundant, but not necessarily) save of xmm non-volatile registers, which adds up something. Also, I wouldn't be surpised if Windows has some fast syscall path which dosen't save the full context, like maybe for syscalls that simply query some information available in memory and don't expect to change task's state. Given how many syscalls Windows has and how they are used, I think it would be something natural (if not must to) have.