On Mon Oct 31 15:20:31 2022 +0000, Jacek Caban wrote:
Do you know performance impact from this? While we will need to take some hit, it would be good to know how much and perhaps have some plans for mitigations. We still have direct calls in winevulkan that we will need to get rid of (either by optimizing further syscall thunks or batching command buffers). For OpenGL, batching is more tricky and there is hypothesis that OpenGL generally requires more calls (so more syscall thunking). We may need to live with it, but it would be good to have some data instead of speculations for better judgement.
Well this MR specifically should have very little effect, as it's not going through the syscall dispatcher yet.
Anyway, I don't have much numbers, but I've run a few tests with Unigine Valley / Heaven benchmarks, running in Low settings and 1280x720 to try to make sure to be CPU-bound. This may not be very representative of the varieties of games out there but it's a starting point.
With current master (avg FPS / score / perf top highest hitter CPU %):
``` * Valley GL: 167 / 6987 / ~2% in Mesa * Valley D3D9: 129 / 5388 / ~25% in wined3d_cs_run * Heaven GL: 319 / 8032 / ~2% in Mesa * Heaven D3D11: 113 / 2833 / ~15% in wined3d_device_context_emit_map + ~15% in wined3d_cs_mt_finish ```
With the OpenGL32 PE conversion from https://gitlab.winehq.org/wine/wine/-/merge_requests/1010:
``` * Valley GL (PE): 147 / 6127 / ~5-10% in __wine_syscall_dispatcher * Valley D3D9 (PE): 132 / 5520 / ~15% wined3d_cs_run + ~5-10% in __wine_syscall_dispatcher + ~5% wined3d_device_context_emit_map * Heaven GL (PE): 263 / 6645 / ~5-10% in __wine_syscall_dispatcher * Heaven D3D11 (PE): 112 / 2820 / ~15% in wined3d_device_context_emit_map + ~10% in wined3d_cs_emit_present + ~5-10% in __wine_syscall_dispatcher + ~5% in wined3d_cs_mt_finish ```
I also quickly checked with the WINEWOW and wow64 support, and the results are surprisingly similar in GL mode with the win32 results, though I'm not sure how it copes with the wow64 buffer mapping.
The wow64 D3D results are OTOH completely horrible and rendering was broken, but that's probably because of some issues with my wow64 thunks or caused by the buffer map copies.
FWIW I tried various tweaks to the syscall dispatcher, and all the FPU saving modes gives roughly the same results. We can get a significant difference by:
1) Not saving the FPU state (nop instead of xsavec reduces CPU down to 3-5%),
2) Avoid copying arguments with `rep mov`, and instead use something like https://gitlab.winehq.org/wine/wine/-/merge_requests/1074/diffs?commit_id=bb..., (further reduces the CPU down to 1-2%, possibly spreading it but still improving FPS in the benchmarks).
IMHO trying to do some batching is risky, at the very least from the latency perspective, which is something very sensitive for games. The host graphics drivers already go to great lengths to do that kind of thing internally in an optimal way, and I don't think we should add another layer.
Instead I think we should have a per-thread flag indicating whether we really need saving / restoring the FPU state entirely (or just the ABI xmm registers). Then we should be able to enable that flag for any perf critical and Wine-internal thread, such as the D3D ones, and provide a custom entry point for third-party such as DXVK to do the same for their internal threads.
If some games are actually relying on the entire FPU state being saved and restored across syscalls, even for Wine internal threads (like if some DRM somehow manages to check that, or when running in a debugger), we should have a global optional flag that forces it, but it should not be the default.