[Bug 59765] New: wineserver: performance regression in newer Wine versions (apparently related to Winsock/socket polling)
http://bugs.winehq.org/show_bug.cgi?id=59765 Bug ID: 59765 Summary: wineserver: performance regression in newer Wine versions (apparently related to Winsock/socket polling) Product: Wine Version: unspecified Hardware: x86-64 OS: Linux Status: UNCONFIRMED Severity: normal Priority: P2 Component: wineserver Assignee: wine-bugs@list.winehq.org Reporter: csoresz2@gmail.com Distribution: --- Created attachment 80967 --> http://bugs.winehq.org/attachment.cgi?id=80967 CPU usage on various Ubuntu versions Hello Wine developers, I would like to report what looks like a significant wineserver performance regression in newer Wine versions, apparently related to Winsock/socket polling. We run a Windows application stack under Wine on Linux. The architecture is roughly: - one main Windows service process running under Wine - many child worker processes launched by that service - each worker process receives RTSP video streams from IP cameras - the RTSP implementation uses live555 internally - the streams mostly use RTSP/RTP over TCP - a typical system runs around 50-60 worker processes under the same Wine prefix The same workload behaves very differently on two otherwise similar machines: Old system: Ubuntu 20.x + Wine 5.x New system: Ubuntu 26.x + Wine 11.8 On the old system, wineserver uses about 25% of one CPU core and the application runs well. On the new system, wineserver frequently reaches 100% of one CPU core and becomes a bottleneck. Basically anything above Ubuntu 20.x and Wine 5.x is using more CPU. Tested on roughly 50 Linux servers with exactly the same Windows application. See the attachment. Wine versions tested: - wine-5.0 (Ubuntu 5.0-3ubuntu1) - wine-9.0 (Ubuntu 9.0~repack-4build3) - wine-9.15 - wine-10.12 - wine-10.16 - wine-10.19 - wine-11.0 - wine-11.8 Ubuntu versions tested: - 20.04 - 22.04 - 24.04 - 25.10 - 26.04 Let's stick to the original lineup: Ubuntu 20.x + Wine 5.x VS Ubuntu 26.x + Wine 11.8 I collected strace -c data from the wineserver process on both systems under comparable load. On the newer Wine/Linux system, the wineserver syscall profile looks approximately like this: ioctl ~280,000 calls read ~250,000 calls write ~132,000 calls epoll_ctl ~142,000 calls getsockopt ~52,000 calls poll ~50,000 calls On the older Wine/Linux system, the same workload looks very different: write ~443,000 calls read ~462,000 calls epoll_wait ~18,000 calls epoll_ctl ~209 calls poll ~95 calls getsockopt ~6 calls The most striking difference is epoll_ctl: roughly 142,000 calls on the new system versus only about 200 on the old system. ioctl, getsockopt, and poll are also much higher on the newer Wine/Linux system. I also installed debug symbols and profiled wineserver with perf. The hot path on the new system is mostly: main_loop_epoll fd_poll_event thread_poll_event read_request call_req_handler req_ioctl poll_socket sock_poll_event sock_reselect sock_get_poll_events There is also time under: req_select select_on end_wait object_sync_satisfied async_satisfied async_set_result async_reselect sock_reselect_async sock_reselect sock_get_poll_events I also traced epoll_ctl calls. Almost all of them are EPOLL_CTL_MOD, not ADD/DEL: EPOLL_CTL_ADD 2 EPOLL_CTL_DEL 2 EPOLL_CTL_MOD 11625 The event masks are constantly being toggled, mostly between no events and readable/priority events: events=0 5842 events=EPOLLIN|EPOLLPRI 3711 events=EPOLLIN 1961 events=EPOLLOUT 100 events=EPOLLPRI 9 events=EPOLLPRI|EPOLLOUT 2 The affected file descriptors are mostly TCP sockets, including RTSP camera connections such as: TCP:[local_ip:port->camera_ip:554] This strongly suggests that the newer Wine version is doing much more wineserver-side socket re-selection / epoll event mask modification for the same Winsock workload. The application uses live555, so the socket behavior is probably based on normal Winsock select() / ioctlsocket() / recv() style logic. I tried several application-side experiments, including increasing the live555 select() timeout and changing scheduler behavior, but these did not produce a useful improvement. This makes me suspect the regression may be in the Wine socket / wineserver handling path rather than in the application logic itself. The main symptom is that the newer Wine version appears to generate a very large amount of: req_ioctl -> poll_socket -> sock_reselect -> sock_get_poll_events -> epoll_ctl(EPOLL_CTL_MOD) activity for many concurrent TCP RTSP streams, while Wine 5 on the older system does not. Could this be a known change or regression in wineserver socket polling, async socket handling, or Winsock select() / ioctlsocket() emulation between Wine 5 and newer Wine versions? If there is anything we can provide to help investigate this, please let us know. Thank you for your work on Wine. Best regards -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 --- Comment #1 from mcsordas <csoresz2@gmail.com> --- Created attachment 80968 --> http://bugs.winehq.org/attachment.cgi?id=80968 strace output for both systems -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 Zeb Figura <z.figura12@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |z.figura12@gmail.com --- Comment #2 from Zeb Figura <z.figura12@gmail.com> --- Usually wineserver per se isn't the bottleneck, only because RPC to wineserver is usually a bigger one. On old versions I believe an immediately satisfied recv() or send() takes only one server call; now two are necessary. Unfortunately avoiding these is hard. The fundamental problem is that I/O on the same socket has to be processed in the order that it's issued, which means we need something to be managing a queue, and since sockets can be cross-process, that needs to be the server. We need one server call to insert ourselves into the queue, and a second to remove ourselves after the I/O is satisfied. (Or, if we're not at the front of the queue, the first server call returns that status and we wait or return EWOULDBLOCK as appropriate.) Note that you can't special case nonblocking sockets, because you can still issue blocking I/O on a nonblocking socket using something like WSARecv(). In older versions of Wine, select() didn't take server RPC, but things have changed so now it's necessary for compatibility reasons. I don't remember the details offhand. I think an immediately satsifiable select takes one server call; one which has to wait takes two? Most applications aren't going to run into the corner cases that mattered, which is why things worked fine on 5.x. Most applications also don't care that much about socket I/O performance, I guess because the actual overhead of socket I/O is usually higher. If it's all loopback, I guess that might matter more. Unfortunately cutting the server out is going to be *very* hard. It'll take some thought. Still, it might be worth doing some more careful performance analysis on this, because the most obvious bottlenecks are also probably the hardest by far to remove. Counting calls can be misleading, so can perf, especially when there are multiple processes. I'd probably instead try to measure the time between cycles, and see how that's increased overall, as well as more fine grained differences e.g. how long select() takes including the sleep, how long send() takes, how long recv() takes... -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 Zeb Figura <z.figura12@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|wineserver |winsock Summary|wineserver: performance |Custom application using |regression in newer Wine |sockets performs worse on |versions (apparently |11.8 vs 5.0 |related to Winsock/socket | |polling) | Keywords| |performance, regression Version|unspecified |11.8 -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 --- Comment #3 from mcsordas <csoresz2@gmail.com> --- Hello Zeb, Thanks for your reply. We try to make some measures based on your advices. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 --- Comment #4 from mcsordas <csoresz2@gmail.com> --- Hi, We added application-side timing around the Winsock calls used by the RTSP client: select(), recv(), and send(). The send() calls are negligible in our case, as expected. recv() is called frequently, and select() is also called frequently. We also measured the full live555 scheduler iteration, around BasicTaskScheduler::SingleStep(). One interesting result is that select() wall-clock time by itself does not show the regression clearly. The select() average time is in the same general range on Windows, Wine 5, and Wine 11.8, probably because the measurement includes normal waiting time. However, recv() shows a much larger difference. Native Windows system, about a 30-second interval: singlestep count=15751 avg=1904.36us over1ms=20.20% over5ms=13.03% over10ms=9.36% singlestep count=15335 avg=1955.60us over1ms=20.88% over5ms=13.45% over10ms=9.63% select count=15439 avg=1928.37us over1ms=20.02% over5ms=13.36% over10ms=9.59% recv count=32620 avg=1.13us over1ms=0.00% over5ms=0.00% over10ms=0.00% select count=16511 avg=1799.67us over1ms=19.10% over5ms=12.42% over10ms=8.90% recv count=35142 avg=1.29us over1ms=0.01% over5ms=0.00% over10ms=0.00% Wine 11.8 system, about a 30-second interval: singlestep count=14931 avg=2009.05us over1ms=55.91% over5ms=8.33% over10ms=5.75% singlestep count=13982 avg=2145.34us over1ms=57.36% over5ms=11.03% over10ms=7.67% select count=12104 avg=1407.79us over1ms=11.20% over5ms=8.88% over10ms=6.54% recv count=27682 avg=459.68us over1ms=0.04% over5ms=0.00% over10ms=0.00% select count=11466 avg=1665.23us over1ms=15.08% over5ms=11.71% over10ms=8.57% recv count=25371 avg=422.41us over1ms=0.03% over5ms=0.00% over10ms=0.00% Wine 5 system, about a 30-second interval: singlestep count=16461 avg=1821.97us over1ms=22.35% over5ms=16.72% over10ms=12.52% singlestep count=17383 avg=1725.04us over1ms=21.17% over5ms=15.73% over10ms=11.78% select count=17881 avg=1619.30us over1ms=20.14% over5ms=15.03% over10ms=11.28% recv count=36613 avg=23.89us over1ms=0.12% over5ms=0.00% over10ms=0.00% select count=16702 avg=1741.18us over1ms=21.58% over5ms=16.38% over10ms=12.27% recv count=33725 avg=21.25us over1ms=0.00% over5ms=0.00% over10ms=0.00% For a single stream, recv() is roughly: Windows: ~1.2 us average Wine 5: ~21-24 us average Wine 11.8: ~420-460 us average So compared to Wine 5, recv() is roughly 18-22x slower on Wine 11.8. Looking at the accumulated recv() time over roughly 30 seconds: Windows: ~0.037-0.045 seconds Wine 5: ~0.72-0.87 seconds Wine 11.8: ~10.7-12.7 seconds So even though almost all recv() calls are still below 1 ms individually, the accumulated cost becomes significant. The full live555 SingleStep() timing also changed. The average SingleStep() time is only moderately higher on Wine 11.8: Windows: ~1.90-1.96 ms Wine 5: ~1.73-1.82 ms Wine 11.8: ~2.01-2.15 ms However, the percentage of SingleStep() calls taking more than 1 ms is much higher on Wine 11.8: Windows: ~20-21% Wine 5: ~21-22% Wine 11.8: ~56-57% The per-call recv() cost is still small in absolute terms, but the call count is high enough that it adds up quickly. Please let us know if there is anything specific we should try or measure next. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 --- Comment #5 from Zeb Figura <z.figura12@gmail.com> --- (In reply to mcsordas from comment #4)
For a single stream, recv() is roughly:
Windows: ~1.2 us average Wine 5: ~21-24 us average Wine 11.8: ~420-460 us average So compared to Wine 5, recv() is roughly 18-22x slower on Wine 11.8.
Hmm. I tested locally on this machine and I see roughly 10 ns with 5.0 vs 20 ns with 11.8. It's different hardware but I'd still expect to be able to reproduce something close to that 20x difference. What kind of parameters are you looking at, buffer size and whatnot? Is this just TCP loopback? -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 --- Comment #6 from mcsordas <csoresz2@gmail.com> --- (In reply to Zeb Figura from comment #5)
(In reply to mcsordas from comment #4)
For a single stream, recv() is roughly:
Windows: ~1.2 us average Wine 5: ~21-24 us average Wine 11.8: ~420-460 us average So compared to Wine 5, recv() is roughly 18-22x slower on Wine 11.8.
Hmm. I tested locally on this machine and I see roughly 10 ns with 5.0 vs 20 ns with 11.8. It's different hardware but I'd still expect to be able to reproduce something close to that 20x difference. What kind of parameters are you looking at, buffer size and whatnot? Is this just TCP loopback?
Hi, Thanks for checking. One important detail: these measurements were taken on a production-like/live system where around 50 streams were running concurrently. The timing data I sent is from one stream/client, but it was not measured in isolation. The system was under the normal multi-stream workload, with wineserver already under high load. Although I only included timing data from one stream, the other streams showed similar numbers. This is not TCP loopback. These are real RTSP-over-TCP camera streams over the network, using live555. The connections are typically to remote camera endpoints, mostly on port 554. I also checked the receive buffer sizes passed to recvfrom() in this live555 code path. The pattern is mostly small reads, for example: bs: 1452 bs: 1 bs: 1 bs: 1 ... The 1452-byte reads appear frequently, and there are also many 1-byte reads. This looks like the normal RTSP-over-TCP / interleaved stream parsing path. So the numbers are not from an isolated recv() microbenchmark. They are measured around the actual Winsock recvfrom() calls in the live555-based RTSP-over-TCP receive path, while many other similar streams are active at the same time. Please let us know if there is a specific parameter or test setup you would like us to try. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 --- Comment #7 from mcsordas <csoresz2@gmail.com> --- Hello, We continued testing on a live RTSP-over-TCP workload with 43 concurrent streams. Without any application-side optimizations, wineserver CPU was around 95%. We then tried a few application-side workarounds to reduce the number of real Winsock calls reaching Wine. These are not fully clean or generally reliable solutions, but the results are interesting: Reducing real recv() calls with application-side buffering lowered wineserver CPU from ~95% to ~75%. About 80% of recv() calls were served from the application buffer. Then reduced real select() calls by using application-side readiness emulation when buffered data was already available. About 60% of select() calls were avoided/emulated. This lowered wineserver CPU further to ~63%. Keeping those changes and adding a small 1 ms delay before the remaining real select() calls lowered wineserver CPU to ~50%. This seems to support the idea that the bottleneck is strongly tied to the number of socket I/O / readiness operations that reach wineserver. However, these are application-side hacks involving buffering, readiness emulation, and throttling. They work in our tests, but they are not a clean or generally safe solution. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59765 --- Comment #8 from Zeb Figura <z.figura12@gmail.com> --- Thanks for the information. I have an idea, but it'll take some time to implement. Frankly, it's probably not a bad idea to do some client side modifications at all possible. 1-byte reads are never going to perform well. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
participants (1)
-
WineHQ Bugzilla