To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com --- v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Tested in windowed 1280x720, all graphics settings to low. I could also see a +10% CPU usage in perf, in the syscall dispatcher, as expected.
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
Would sending patches during the code freeze optimize this path count as regression fixing? 😅
On 12/10/21 10:37, Derek Lesho wrote:
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
Would sending patches during the code freeze optimize this path count as regression fixing? 😅
I guess, but I don't really see why this would be critical to include now, I believe Wine 7.0 will still be missing critical pieces to make winevulkan wow64 usable.
I also wouldn't bet on it being easily possible and in a way that's acceptable for the code freeze (ie: without too many or ugly changes).
It would also imho impede an eventual Proton rebase, as we'll not want that kind of regression so we'd probably have to revert the changes, and not even try to fix the regression or make some ugly hacks to workaround it.
Cheers,
On 12/10/21 12:41, Rémi Bernon wrote:
On 12/10/21 10:37, Derek Lesho wrote:
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
Would sending patches during the code freeze optimize this path count as regression fixing? 😅
I guess, but I don't really see why this would be critical to include now, I believe Wine 7.0 will still be missing critical pieces to make winevulkan wow64 usable.
I also wouldn't bet on it being easily possible and in a way that's acceptable for the code freeze (ie: without too many or ugly changes).
It would also imho impede an eventual Proton rebase, as we'll not want that kind of regression so we'd probably have to revert the changes, and not even try to fix the regression or make some ugly hacks to workaround it.
Cheers,
Yeah, considering that:
- winevulkan, opengl and potentially a bit of other stuff unfortunately have more issues for 32 on 64 without explored solution yet and even in the most optimistic case won't be there around 7.0;
- there is an example of the game which shows measurable performance drop;
- it is the last day before the code freeze and details are not yet known and it is not clear what fixing the performance drop will take.
I'd also suggest to explore that first without a rush.
On 12/10/21 11:08, Paul Gofman wrote:
On 12/10/21 12:41, Rémi Bernon wrote:
On 12/10/21 10:37, Derek Lesho wrote:
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
Would sending patches during the code freeze optimize this path count as regression fixing? 😅
I guess, but I don't really see why this would be critical to include now, I believe Wine 7.0 will still be missing critical pieces to make winevulkan wow64 usable.
I also wouldn't bet on it being easily possible and in a way that's acceptable for the code freeze (ie: without too many or ugly changes).
It would also imho impede an eventual Proton rebase, as we'll not want that kind of regression so we'd probably have to revert the changes, and not even try to fix the regression or make some ugly hacks to workaround it.
Cheers,
Yeah, considering that:
- winevulkan, opengl and potentially a bit of other stuff unfortunately
have more issues for 32 on 64 without explored solution yet and even in the most optimistic case won't be there around 7.0;
there is an example of the game which shows measurable performance drop;
it is the last day before the code freeze and details are not yet
known and it is not clear what fixing the performance drop will take.
I'd also suggest to explore that first without a rush.
Also fwiw regarding perf tool, the recent ntdll.dll 64bit base address change makes it fail to match symbols correctly, it needs to be put in the 32bit address space again to make any meaningful measurement with perf.
On 12/10/21 10:41 AM, Rémi Bernon wrote:
On 12/10/21 10:37, Derek Lesho wrote:
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
Would sending patches during the code freeze optimize this path count as regression fixing? 😅
I guess, but I don't really see why this would be critical to include now, I believe Wine 7.0 will still be missing critical pieces to make winevulkan wow64 usable.
I don't think this is critical. The most important parts of the series were committed yesterday and with them, we no longer need old style Unix libs. The rest would be still nice to have. Even if we don't enable syscalls for everything, it would be nice to be prepared for them.
BTW, wow64 is not everything. For example, the patchset avoids Windows debuggers being confused by Vulkan calls.
I also wouldn't bet on it being easily possible and in a way that's acceptable for the code freeze (ie: without too many or ugly changes).
It would also imho impede an eventual Proton rebase, as we'll not want that kind of regression so we'd probably have to revert the changes, and not even try to fix the regression or make some ugly hacks to workaround it.
Note that only the last patch from the series is risky for performance. That patch is also small and easy to tweak or revert, if needed, so I'm sure we will not anything uglier than that. Maybe we could just disable it for command buffer (and possibly some other frequent calls) for now and have the codebase otherwise ready and tested.
Thanks,
Jacek
On 12/10/21 14:41, Jacek Caban wrote:
On 12/10/21 10:41 AM, Rémi Bernon wrote:
On 12/10/21 10:37, Derek Lesho wrote:
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
Would sending patches during the code freeze optimize this path count as regression fixing? 😅
I guess, but I don't really see why this would be critical to include now, I believe Wine 7.0 will still be missing critical pieces to make winevulkan wow64 usable.
I don't think this is critical. The most important parts of the series were committed yesterday and with them, we no longer need old style Unix libs. The rest would be still nice to have. Even if we don't enable syscalls for everything, it would be nice to be prepared for them.
BTW, wow64 is not everything. For example, the patchset avoids Windows debuggers being confused by Vulkan calls.
I completely understand that, and we definitely want the developers to be able to debug their games seamlessly.
However I think it's a specific and minority use case, and most people will just run the games and are only interested in having the best performance they can get with their hardware.
In the same way as debug builds could be worth for developers (debug symbols ofc, but we could go further than that), but are clearly not something we want to have for normal users, maybe we can consider having a compile-time switch for this kind of "full debugger support" feature?
I'm not completely aware of why the full FPU context needs to be saved and restored for instance, but if it's only for the debugging experience, could that simply be stripped in such builds?
Cheers,
On 12/10/21 17:00, Rémi Bernon wrote:
I'm not completely aware of why the full FPU context needs to be saved and restored for instance, but if it's only for the debugging experience, could that simply be stripped in such builds?
If you plainly just strip saving FPU context you will get NtGetContextThread broken without any debugger involved. I believe restoring the FPU context is optimized out already for the (majority of) cases when it is not needed due to setting FPU context for thread. Of course one can hack something around and then enable only for games which really need that, after spending time finding out that they do.
If we talk about AVX (ymm) registers with xsavec support that is only actually saved if there are nonzero registers YMM registers upon the call. And given those registers are volatile compilers tend to often do vzeroupper before function calls as far as I could see (probably exactly to avoid context saving overhead on the syscalls otherwise present both under Windows and Linux without Wine involved).
Then, there is a part of non-volatile ms_abi XMM registers which are volatile on sysv_abi and those are allways saved in compiler generated prologue once ms_abi function calls sysv_abi. So going through Wine->Unix gate just changes the place where those are saved and in general a clear PE - unix part separation should be removing a great amount of these saves across function calls.
The part which stays excessive is volatile XMM register saves, but that is probably relatively minor and might be subject for fine but ugly optimization if we ever to introduce a "lightweight" dispatcher for non-blocking call. But I'd expect this overhead to be less than what we gain over the split for removing extra non-volatile XMM register saves.
On 12/10/21 15:22, Paul Gofman wrote:
On 12/10/21 17:00, Rémi Bernon wrote:
I'm not completely aware of why the full FPU context needs to be saved and restored for instance, but if it's only for the debugging experience, could that simply be stripped in such builds?
If you plainly just strip saving FPU context you will get NtGetContextThread broken without any debugger involved. I believe restoring the FPU context is optimized out already for the (majority of) cases when it is not needed due to setting FPU context for thread. Of course one can hack something around and then enable only for games which really need that, after spending time finding out that they do.
If we talk about AVX (ymm) registers with xsavec support that is only actually saved if there are nonzero registers YMM registers upon the call. And given those registers are volatile compilers tend to often do vzeroupper before function calls as far as I could see (probably exactly to avoid context saving overhead on the syscalls otherwise present both under Windows and Linux without Wine involved).
Then, there is a part of non-volatile ms_abi XMM registers which are volatile on sysv_abi and those are allways saved in compiler generated prologue once ms_abi function calls sysv_abi. So going through Wine->Unix gate just changes the place where those are saved and in general a clear PE - unix part separation should be removing a great amount of these saves across function calls.
The part which stays excessive is volatile XMM register saves, but that is probably relatively minor and might be subject for fine but ugly optimization if we ever to introduce a "lightweight" dispatcher for non-blocking call. But I'd expect this overhead to be less than what we gain over the split for removing extra non-volatile XMM register saves.
Well as far as I could see, in all the measurements I've made for a long while xsavec64 / xrstor64 were the highest hitters, and something like 10% to 30% CPU time spent on these two instructions.
I don't know if that should be part of the "optimized vast majority of cases", but it's the same with Proton or with current upstream Wine. In any case it's way worse and nowhere near the overhead we had a year ago with SSE XMM register spilling from the ABI transitions generated by the compiler. -- Rémi Bernon rbernon@codeweavers.com
On 12/10/21 17:36, Rémi Bernon wrote:
Well as far as I could see, in all the measurements I've made for a long while xsavec64 / xrstor64 were the highest hitters, and something like 10% to 30% CPU time spent on these two instructions.
I don't know if that should be part of the "optimized vast majority of cases", but it's the same with Proton or with current upstream Wine. In any case it's way worse and nowhere near the overhead we had a year ago with SSE XMM register spilling from the ABI transitions generated by the compiler.
It is strange. Unlike the current Proton, xrstor64 should not appear at all with upstream Wine (unless the app is setting the context for threads constantly). Are you sure that is the case with upstream? In that case maybe there is some genuine bug or easy optimization opportunity involved.
Then, for xsavec64, it is doing saves in one instruction which otherwise split in many with compiler generated registers saves. Can it be it affects its place in perf top? Overall I don't think perf can show anything meaningful on such a fine measurement level due to sampling measurements specifics and CPU instruction flow specifics. A good and simple way would be to measure the time over a large amount of calls with rdtsc with and without xsavec, in the real game by instrumenting the syscall dispatcher with that measurement (fwiw I have such an instrumentation for relay stubs locally but not for syscall dispatcher atm).
On 12/10/21 15:43, Paul Gofman wrote:
On 12/10/21 17:36, Rémi Bernon wrote:
Well as far as I could see, in all the measurements I've made for a long while xsavec64 / xrstor64 were the highest hitters, and something like 10% to 30% CPU time spent on these two instructions.
I don't know if that should be part of the "optimized vast majority of cases", but it's the same with Proton or with current upstream Wine. In any case it's way worse and nowhere near the overhead we had a year ago with SSE XMM register spilling from the ABI transitions generated by the compiler.
It is strange. Unlike the current Proton, xrstor64 should not appear at all with upstream Wine (unless the app is setting the context for threads constantly). Are you sure that is the case with upstream? In that case maybe there is some genuine bug or easy optimization opportunity involved.
Then, for xsavec64, it is doing saves in one instruction which otherwise split in many with compiler generated registers saves. Can it be it affects its place in perf top? Overall I don't think perf can show anything meaningful on such a fine measurement level due to sampling measurements specifics and CPU instruction flow specifics. A good and simple way would be to measure the time over a large amount of calls with rdtsc with and without xsavec, in the real game by instrumenting the syscall dispatcher with that measurement (fwiw I have such an instrumentation for relay stubs locally but not for syscall dispatcher atm).
I'm sure perf is very able to tell us useful information. That it uses sampling doesn't matter as it's not something that only happens randomly and spuriously, but a constant overhead, and if 10% of the samples now fall into the __wine_syscall_dispatcher, it's just very likely that it just takes 10% more CPU time.
Now, because I'm a bit annoyed that perf usage is always questioned, I also did the test:
1) With just this patch series applied, I get ~125fps, ~10% global CPU time spent in __wine_syscall_dispatcher_prolog_end (highest perf top hitter), of which 76% are assigned directly to the xsavec64 instruction.
2) Commenting out xrstor64, in __wine_syscall_dispatcher, I get roughly the same thing, unsurprisingly.
3) Commenting out both xsavec64 and xrstor64, I get ~150fps, ~3% global CPU time spent in the dispatcher (still highest hitter but close to d3d12_desc_copy_range, which is the main one without this series), and of which it is spread quite evenly over all the dispatcher instructions.
Can we put a little bit more trust in the tool now? I'm not saying it's always right, and that CPU time wouldn't be spent or wasted by the game otherwise, but here, it's clearly telling us that we are adding some overhead and where it comes from.
On 12/10/21 18:13, Rémi Bernon wrote:
Can we put a little bit more trust in the tool now? I'm not saying it's always right, and that CPU time wouldn't be spent or wasted by the game otherwise, but here, it's clearly telling us that we are adding some overhead and where it comes from.
Did I ever say that perf spot is always wrong? Your present results confirms the spot here is not random. How do I tell if that is the same in arbitrary case before checking? It seems you are not saying that it is always right as well. And while the perf was and is the first thing I run (like probably most of other people), I don't see how this case change anything in my general trust to the tool or convince me that I should stop checking those spots and stop figuring the model explaining the behaviour (tbh I have no idea how anything like that can be sanely fixed at all without the latter).
On 12/10/21 3:00 PM, Rémi Bernon wrote:
On 12/10/21 14:41, Jacek Caban wrote:
On 12/10/21 10:41 AM, Rémi Bernon wrote:
On 12/10/21 10:37, Derek Lesho wrote:
On 12/10/21 10:26, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
Maybe this could wait after 7.0 and take some more time to explore possible mitigations?
Would sending patches during the code freeze optimize this path count as regression fixing? 😅
I guess, but I don't really see why this would be critical to include now, I believe Wine 7.0 will still be missing critical pieces to make winevulkan wow64 usable.
I don't think this is critical. The most important parts of the series were committed yesterday and with them, we no longer need old style Unix libs. The rest would be still nice to have. Even if we don't enable syscalls for everything, it would be nice to be prepared for them.
BTW, wow64 is not everything. For example, the patchset avoids Windows debuggers being confused by Vulkan calls.
I completely understand that, and we definitely want the developers to be able to debug their games seamlessly.
However I think it's a specific and minority use case, and most people will just run the games and are only interested in having the best performance they can get with their hardware.
Sure, that's why we want to make it both performant and correct, hopefully in the same build.
In the same way as debug builds could be worth for developers (debug symbols ofc, but we could go further than that), but are clearly not something we want to have for normal users, maybe we can consider having a compile-time switch for this kind of "full debugger support" feature?
That's something that projects like Proton can easily do on top of that series until we have a full solution. But those syscalls are not only about debuggers, so we will need a better solution sooner or later.
I'm not completely aware of why the full FPU context needs to be saved and restored for instance, but if it's only for the debugging experience, could that simply be stripped in such builds?
Mostly what Paul wrote. It's not really possible to handle everything properly without those stores. Hacking around it quickly gets ugly and hacks can't get all cases right anyway.
Jacek
On 12/10/21 10:26 AM, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
That's interesting, it's worse than what I've seen in cases that seemed to be pretty bad examples wrt. an impact of those patches. I will look at it myself as well, but for comparison, could you please try the attached patch on top of the series? If it's similar to what I've seen so far, that should mitigate the problem.
This is related to a possibility of further optimizations. Majority of Vulkan calls are usually command buffer recording calls. We could record their parameters ourselves on PE side, with no syscall involved, and replay them on Unix side when needed (usually in vkEndCommandBuffer). That should cover majority of syscall overhead at expense of additional layer of command buffer recording.
Thanks,
Jacek
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=103842
Your paranoid android.
=== debiant2 (build log) ===
error: patch failed: dlls/winevulkan/make_vulkan:676 Task: Patch failed to apply
=== debiant2 (build log) ===
error: patch failed: dlls/winevulkan/make_vulkan:676 Task: Patch failed to apply
On 12/10/21 14:21, Jacek Caban wrote:
On 12/10/21 10:26 AM, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
That's interesting, it's worse than what I've seen in cases that seemed to be pretty bad examples wrt. an impact of those patches. I will look at it myself as well, but for comparison, could you please try the attached patch on top of the series? If it's similar to what I've seen so far, that should mitigate the problem.
Hi Jacek, sorry but the patch doesn't help.
On 12/10/21 16:52, Rémi Bernon wrote:
On 12/10/21 14:21, Jacek Caban wrote:
On 12/10/21 10:26 AM, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
That's interesting, it's worse than what I've seen in cases that seemed to be pretty bad examples wrt. an impact of those patches. I will look at it myself as well, but for comparison, could you please try the attached patch on top of the series? If it's similar to what I've seen so far, that should mitigate the problem.
Hi Jacek, sorry but the patch doesn't help.
FWIW I guess WINEDEBUG=+vulkan log plus the following processing has a good chance of telling what is involved:
cut log.txt -d':' -f2 | cut -d' ' -f1 | sort | uniq -c
On 12/10/21 2:52 PM, Rémi Bernon wrote:
On 12/10/21 14:21, Jacek Caban wrote:
On 12/10/21 10:26 AM, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
That's interesting, it's worse than what I've seen in cases that seemed to be pretty bad examples wrt. an impact of those patches. I will look at it myself as well, but for comparison, could you please try the attached patch on top of the series? If it's similar to what I've seen so far, that should mitigate the problem.
Hi Jacek, sorry but the patch doesn't help.
I repeated your tests and for me, the attached patch is enough to make performance difference marginal. It adds direct calls for vkUpdateDescriptorSets compared to previous patch.
Thanks,
Jacek
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=103854
Your paranoid android.
=== debiant2 (build log) ===
error: patch failed: dlls/winevulkan/make_vulkan:151 Task: Patch failed to apply
=== debiant2 (build log) ===
error: patch failed: dlls/winevulkan/make_vulkan:151 Task: Patch failed to apply
On 10.12.21 14:21, Jacek Caban wrote:
On 12/10/21 10:26 AM, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
That's interesting, it's worse than what I've seen in cases that seemed to be pretty bad examples wrt. an impact of those patches. I will look at it myself as well, but for comparison, could you please try the attached patch on top of the series? If it's similar to what I've seen so far, that should mitigate the problem.
This is related to a possibility of further optimizations. Majority of Vulkan calls are usually command buffer recording calls. We could record their parameters ourselves on PE side, with no syscall involved, and replay them on Unix side when needed (usually in vkEndCommandBuffer). That should cover majority of syscall overhead at expense of additional layer of command buffer recording.
In vkd3d-proton's case a big problem might be vkUpdateDescriptorSets, and unlike vkCmd* I don't see a way to avoid one unix call per vkUpdateDescriptorSets.
Thanks,
Georg
Thanks,
Jacek
On 12/10/21 4:06 PM, Georg Lehmann wrote:
On 10.12.21 14:21, Jacek Caban wrote:
On 12/10/21 10:26 AM, Rémi Bernon wrote:
On 12/10/21 03:07, Jacek Caban wrote:
To allow them being accessed from a struct.
Signed-off-by: Jacek Caban jacek@codeweavers.com
v2: make remaining direct calls more similar to __wine_unix_call
dlls/winevulkan/make_vulkan | 59 ++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 27 deletions(-)
Thanks, it indeed fixes the issue with Control DX12.
Now that it works I could measure that the series causes a ~25% fps drop in that same game, from an average of 165fps to 125fps measured with WINEDEBUG=+fps, while being steady near the beginning of the game.
That's interesting, it's worse than what I've seen in cases that seemed to be pretty bad examples wrt. an impact of those patches. I will look at it myself as well, but for comparison, could you please try the attached patch on top of the series? If it's similar to what I've seen so far, that should mitigate the problem.
This is related to a possibility of further optimizations. Majority of Vulkan calls are usually command buffer recording calls. We could record their parameters ourselves on PE side, with no syscall involved, and replay them on Unix side when needed (usually in vkEndCommandBuffer). That should cover majority of syscall overhead at expense of additional layer of command buffer recording.
In vkd3d-proton's case a big problem might be vkUpdateDescriptorSets, and unlike vkCmd* I don't see a way to avoid one unix call per vkUpdateDescriptorSets.
Yes, I was about to write that, vkUpdateDescriptorSets is the single most frequently used call in this case. If I disable syscalls for that, syscall overhead goes down drastically. I could adjust the patch to do direct call for that, but yeah, a solution like vkCmd* will not work here.
Thanks,
Jacek
One possible optimization that Paul and I had discussed a while back was to allow for the possibility of "masked" or "uninterruptible" syscalls, that is, syscalls that cannot be directly interrupted with SuspendThread(). The idea is that instead of saving the whole nonvolatile context up front, you instead set a flag in the syscall dispatcher (when SIGUSR1 is received) that tells the restore path to save the whole context and break into the suspend handler.
This would only be usable for calls which don't sleep (or don't sleep for a meaningful amount of time), but as I understand that could include most Vulkan calls, including vkUpdateDescriptorSets and other important ones.
On 12/10/21 18:10, Zebediah Figura wrote:
One possible optimization that Paul and I had discussed a while back was to allow for the possibility of "masked" or "uninterruptible" syscalls, that is, syscalls that cannot be directly interrupted with SuspendThread(). The idea is that instead of saving the whole nonvolatile context up front, you instead set a flag in the syscall dispatcher (when SIGUSR1 is received) that tells the restore path to save the whole context and break into the suspend handler.
This would only be usable for calls which don't sleep (or don't sleep for a meaningful amount of time), but as I understand that could include most Vulkan calls, including vkUpdateDescriptorSets and other important ones.
I think we should have that kind of optimization, and a better understanding of the xsavec64 issue and mitigation, before adding syscall overhead to more functions. Especially when we don't have a very clear view on on how they are used.
On 12/10/21 6:10 PM, Zebediah Figura wrote:
One possible optimization that Paul and I had discussed a while back was to allow for the possibility of "masked" or "uninterruptible" syscalls, that is, syscalls that cannot be directly interrupted with SuspendThread(). The idea is that instead of saving the whole nonvolatile context up front, you instead set a flag in the syscall dispatcher (when SIGUSR1 is received) that tells the restore path to save the whole context and break into the suspend handler.
This would only be usable for calls which don't sleep (or don't sleep for a meaningful amount of time), but as I understand that could include most Vulkan calls, including vkUpdateDescriptorSets and other important ones.
Note that it's not just SuspendThread() that would be affected, but also system APCs.
In cases like internal Vulkan syscalls we have a bit less compatibility constrains than in ntdll, because those are not syscalls called directly by applications. If xsave is causing a problem, we can put CPU into a state when it's cheaper. We discussed earlier doing vzeroupper, we could reset even more parts of the context to make xstor cheaper.
That, however, can help only with context store part of syscalls. There are other things like %fs base switching on the horizon.
Jacek
On 12/10/21 12:00, Jacek Caban wrote:
On 12/10/21 6:10 PM, Zebediah Figura wrote:
One possible optimization that Paul and I had discussed a while back was to allow for the possibility of "masked" or "uninterruptible" syscalls, that is, syscalls that cannot be directly interrupted with SuspendThread(). The idea is that instead of saving the whole nonvolatile context up front, you instead set a flag in the syscall dispatcher (when SIGUSR1 is received) that tells the restore path to save the whole context and break into the suspend handler.
This would only be usable for calls which don't sleep (or don't sleep for a meaningful amount of time), but as I understand that could include most Vulkan calls, including vkUpdateDescriptorSets and other important ones.
Note that it's not just SuspendThread() that would be affected, but also system APCs.
Right, system APCs would be delayed as well. The point is that regardless it wouldn't be holding anything up because it'd only be used for calls that complete immediately.
It's possible there's some problem that would prevent this from working, though. I can't say I've tried to implement it, or really thought through it exhaustively.
In cases like internal Vulkan syscalls we have a bit less compatibility constrains than in ntdll, because those are not syscalls called directly by applications. If xsave is causing a problem, we can put CPU into a state when it's cheaper. We discussed earlier doing vzeroupper, we could reset even more parts of the context to make xstor cheaper.
That, however, can help only with context store part of syscalls. There are other things like %fs base switching on the horizon.
Sure, but %fs switching is something that can't be avoided at all, as I understand, plus it's only relevant if using wow64 support.