http://bugs.winehq.org/show_bug.cgi?id=59027
Bug ID: 59027 Summary: "Rebased" NTSync is broken: massive performance regression Product: Wine Version: 10.19 Hardware: x86-64 OS: Linux Status: UNCONFIRMED Severity: major Priority: P2 Component: -unknown Assignee: wine-bugs@list.winehq.org Reporter: virtuousfox@gmail.com Distribution: ---
Last time I had good performance in wine it was wine-staging-1.10 with the original NTSync patch (MR 7226). But after it was "rebased" in smaller chunks and officially adopted, it as if performance is even worse than before it existed (possibly due to losing esync too). Is it still not fully merged or something, being broken in half-state?
This is evident in the biggest offender I've found: https://bugs.winehq.org/show_bug.cgi?id=54693 - Freedom Planet 2 (and its demo) is back to 20 fps (it should have no problem to get 200 even with CPU-only rendering, like vulkan:llvmpipe). I also see that in Dishonored 2 fps is often stuck at also around 20-40 (previously: 50-75) while GPU is underloaded at 50-75% and 12-core CPU - <10%. At least it's not eating up 70% of all CPU cores, like it did before (or was it only esync's thing?).
But /dev/ntsync is with 666 permissions and I don't see any obvious errors and warnings. Perhaps, it's silently ignored at all or there is other massive regression.
Tested recently with dxvk+app-emulation/vkd3d-proton using DXVK_HUD="devinfo,fps,frametimes,submissions,drawcalls,pipelines,memory,gpuload,api,scale=1.2" but wine's native rendering with mesa's overlay should show the same, last time I've checked. Mesa overlay can be used via: VK_INSTANCE_LAYERS="VK_LAYER_MESA_overlay" VK_LOADER_LAYERS_ENABLE+=",VK_LAYER_MESA_overlay" VK_LAYER_MESA_OVERLAY_CONFIG="fps_sampling_period=80,width=480,position=top-left,submit,draw,pipeline_graphics,vert_invocations,geom_invocations,clip_invocations,frag_invocations,tess_eval_invocations,compute_invocations"
If everything work well, either your fps will be capped at maximum or you should see either CPU/GPU compute load or RAM/VRAM usage at near-100%, being a bottleneck. Otherwise, system is underutilized due to bad timing of something. It this timing is particularly bad.
http://bugs.winehq.org/show_bug.cgi?id=59027
--- Comment #1 from FoX virtuousfox@gmail.com --- After trying to figure this out for months I've just stumbled on a massive breakthrough: it appears that all sync methods in both wine and proton are severely crippled by threading - the more cores they get, the worse they perform but they always try to get all cores.
In place where I get 24-26 fps with NTsync on current wine-staging, I've tried: 1) WINE_CPU_TOPOLOGY=2 wine-proton FP2.exe 2) taskset -c 2-3 wine FP2.exe 3) WINE_CPU_TOPOLOGY=4 wine-proton FP2.exe 4) taskset -c 2-5 wine FP2.exe
The results are astonishing: 1) 110-120 fps; 2) 55-65 fps; 3) 50-60 fps; 4) 24-26 fps.
Meaning that 2 threads (1 core) was the sweet-spot, despite that single core being maxed out on load. I have 12 cores, so you can imagine how bad it's by default. Ironically, proton aced the test in the end but it has started with the worst results by default: 9 fps with default sync and <5 fps for esync & fsync.
However, limiting all wine processes and apps themselves is a bad workaround in general. At least, there should be a way to limit only sync processes. Even pinning everything of entire sync unto a single thread by default does not seem like a bad idea.
http://bugs.winehq.org/show_bug.cgi?id=59027
Zeb Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |z.figura12@gmail.com
--- Comment #2 from Zeb Figura z.figura12@gmail.com --- (In reply to FoX from comment #1)
After trying to figure this out for months I've just stumbled on a massive breakthrough: it appears that all sync methods in both wine and proton are severely crippled by threading - the more cores they get, the worse they perform but they always try to get all cores.
In place where I get 24-26 fps with NTsync on current wine-staging, I've tried:
- WINE_CPU_TOPOLOGY=2 wine-proton FP2.exe
- taskset -c 2-3 wine FP2.exe
- WINE_CPU_TOPOLOGY=4 wine-proton FP2.exe
- taskset -c 2-5 wine FP2.exe
This doesn't make any sense. Sync methods don't, by themselves, "try to get all cores". Applications might, but that shouldn't make ntsync worse.
I also can't reproduce these results. I have a fairly high-powered computer, and with ntsync I reach even the highest FPS limit available (288 FPS). But without ntsync, performance gets worse, and if I limit it to 2 cores with taskset, performance gets worse still. That's more or less what I'd expect.
Can you please test with unmodified upstream non-staging wine, in a fresh prefix, without any external components including dxvk?
http://bugs.winehq.org/show_bug.cgi?id=59027
--- Comment #3 from FoX virtuousfox@gmail.com --- (In reply to Zeb Figura from comment #2)
This doesn't make any sense. Sync methods don't, by themselves, "try to get all cores". Applications might, but that shouldn't make ntsync worse.
This is what doesn't make sense. No matter the application, all cores are always used in background, judging by core utilization graph in gkrellm. I doubt that every single Windows game has something explicitly coded to scale only synchronization on all cores but nothing else.
I also can't reproduce these results. I have a fairly high-powered computer, and with ntsync I reach even the highest FPS limit available (288 FPS). But without ntsync, performance gets worse, and if I limit it to 2 cores with taskset, performance gets worse still. That's more or less what I'd expect.
Good for you but I've never seen such magic. And I did say that performance is worse without ntsync. It's just still bad with it (proton is outlier in this). But I'm 90% sure performance in wine-staging was decent with original ntsync merge request. 10% chance is that I misremembering due to it still being way better than complete slideshow without ntsync and core-limiting.
Make no mistake, at some scenes some games also can reach high fps for me. But when they are affected by this, it tanks hard. It took me few levels to reach one where Freedom Planet 2 is comically slow.
Can you please test with unmodified upstream non-staging wine, in a fresh prefix, without any external components including dxvk?
Did that but had to at least switch to vulkan renderer, as using default opengl one hanged whole wine when I tried loading GALLIUM_HUD (which is opengl-only), so much that I had to use `wineboot -k -f -e` to make wine unstuck. With vanilla wine's native vulkan renderer performance is almost exactly the same as wine-staging with dxvk but way more stuttery. Same core load distribution too.
http://bugs.winehq.org/show_bug.cgi?id=59027
--- Comment #4 from FoX virtuousfox@gmail.com --- Also it appears that kernel scheduling tuning affects fps significantly, likely due to latency of thread switching. I've noticed that on last run boost from taskset was too low, it appeared that tuned silently failed to apply the profile on boot. After forcing it, it returned to previously stated values.
I suspect that these settings influence fps when constrained by taskset up to 50-60% of difference: [cpu] load_threshold=0.33 latency_low=1 latency_high=999 pm_qos_resume_latency_us=200 governor=schedutil energy_perf_bias=performance energy_performance_preference=performance sampling_down_factor=3 min_perf_pct=63 [sysfs] /sys/kernel/debug/sched/min_granularity_ns=2000 /sys/kernel/debug/sched/idle_min_granularity_ns=1000000 /sys/kernel/debug/sched/latency_ns=500000 /sys/kernel/debug/sched/wakeup_granularity_ns=1000 /sys/kernel/debug/sched/tunable_scaling=0 /sys/kernel/debug/sched/migration_cost_ns=4000 /sys/kernel/debug/sched/nr_migrate=1 /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us=50 /sys/block/nvme*n*/queue/scheduler=kyber /sys/block/nvme*n*/queue/nr_requests=512 /sys/block/nvme*n*/queue/max_sectors_kb=2048 /sys/block/nvme*n*/queue/read_ahead_kb=16384 /sys/block/nvme*n*/queue/rq_affinity=2 [sysctl] kernel.sched_autogroup_enabled=0 kernel.sched_cfs_bandwidth_slice_us=1000 kernel.sched_deadline_period_max_us=100000 kernel.sched_deadline_period_min_us=1000 kernel.sched_rt_runtime_us=500000 kernel.sched_rt_period_us=1000000 kernel.sched_rr_timeslice_ms=2 kernel.sched_util_clamp_max=1000 kernel.sched_util_clamp_min=850 kernel.sched_util_clamp_min_rt_default=975 vm.admin_reserve_kbytes=262144 vm.compaction_proactiveness=9 vm.dirty_ratio=24 vm.dirty_background_ratio=16 vm.vfs_cache_pressure=133 vm.swappiness=66 vm.page-cluster=1 vm.watermark_scale_factor=333
My kernel is built with: CONFIG_PREEMPT_LAZY=y CONFIG_PREEMPT_RT=y