https://bugs.winehq.org/show_bug.cgi?id=57442
Bug ID: 57442 Summary: Several applications: abnormal input delay with Wine Product: Wine Version: 9.21 Hardware: x86-64 OS: Linux Status: UNCONFIRMED Severity: enhancement Priority: P2 Component: win32u Assignee: wine-bugs@winehq.org Reporter: ksmnvsg@gmail.com Distribution: ---
I've noticed a slight input delay in one Unreal Engine sample and decided to test things with my custom application (very simple C++ program that uses SDL2 to poll mouse events and log them). I built this application on Linux with g++, and cross-compiled it for Windows with mingW. Native Linux built has around 1.05 ms input delay, while Windows build with Wine 9.21 has 9.5 ms input lag. I believe you can build any application that polls events and test it for yourself, as I observed the same input lag in Unreal Engine samples (though not so severe, it's bounded by the Game Thread time)
The way I tested things is a bit complicated, but it's necessary for my goal: I used Sunshine as a server and Moonlight as a client, and logged when Sunshine receives the input from Moonlight. We can think of it as kernel receiving events since Sunshine creates virtual device and sends input there. Then I logged when application receives the event.
I checked the source code, and I believe the new throttling mechanism introduced in 9.13 version for input is the reason why it's so slow. If you re-implement the previous mechanism based on message count, input delay is almost 0. I wonder if CPU usage would spike though in real games.
https://bugs.winehq.org/show_bug.cgi?id=57442
William Horvath wine@horvath.blog changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |wine@horvath.blog
--- Comment #1 from William Horvath wine@horvath.blog --- Regression commit: 54ca1ab607d3ff22a1f57a9561430f64c75f0916 "win32u: Simplify the logic for driver messages polling."
I noticed this issue from a PeekMessage event loop like this:
while (running) { while (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE)) { if (msg.message == WM_QUIT) { running = false; break; } TranslateMessage(&msg); DispatchMessage(&msg); } if (running) { QueryPerformanceCounter(¤t_time); double elapsed = get_ms_delta(last_check, current_time); if (elapsed >= 0.1) /* 10kHz */ { process_key_state_changes(); last_check = current_time; } Sleep(0); } }
If driver events (e.g. mouse motion, keypresses) are sent at a freq. greater than 1000hz, the processing loop above only receives them in ~10ms intervals on average, because most seem to be entirely dropped. If the events are sent at a freq. less than or equal to 1000hz, then there's barely any added delay (~0.1ms), and all of them are received, same as before the regression commit.
I only tested with `GetKeyboardState` + an external program sending X11 key events at a constant rate, but reverting the mentioned commit allows even 0.25ms keypress intervals (8000hz) to all be received with minimal delay.
https://bugs.winehq.org/show_bug.cgi?id=57442
Rémi Bernon rbernon@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |rbernon@codeweavers.com
--- Comment #2 from Rémi Bernon rbernon@codeweavers.com --- The issue is probably coming from the X11 driver throttle that is in place, and only allow peeking into X11 events every ms. When events are received faster, the X11 driver will start queuing them, and eventually merge inputs events together before sending the input through wineserver to the application.
The input event merging is probably also adding more effective latency, but removing it without checking for X11 events more often, will simply cause batches of events to be sent every 1ms, and won't make much difference.
Removing the throttle will cause applications that use a pattern like this to storm X11 with XCheckIfEvent calls for nothing. This has been causing heavy system loads and we want to prevent that. We could allow slightly more frequent calls, but I don't think we can remove the throttle and we could need a different user driver design instead [*].
[*] I had the idea to move the X11 input event polling elsewhere, possibly in wineserver, but this would be a very large architectural change.
https://bugs.winehq.org/show_bug.cgi?id=57442
--- Comment #3 from ksmnvsg@gmail.com --- (In reply to Rémi Bernon from comment #2)
The issue is probably coming from the X11 driver throttle that is in place, and only allow peeking into X11 events every ms. When events are received faster, the X11 driver will start queuing them, and eventually merge inputs events together before sending the input through wineserver to the application.
The input event merging is probably also adding more effective latency, but removing it without checking for X11 events more often, will simply cause batches of events to be sent every 1ms, and won't make much difference.
Removing the throttle will cause applications that use a pattern like this to storm X11 with XCheckIfEvent calls for nothing. This has been causing heavy system loads and we want to prevent that. We could allow slightly more frequent calls, but I don't think we can remove the throttle and we could need a different user driver design instead [*].
[*] I had the idea to move the X11 input event polling elsewhere, possibly in wineserver, but this would be a very large architectural change.
I've replaced the throttling method based off ticking with the one that Wine had before 9.13 version (based off number of messages) and this latency is gone. It's hard to measure if my CPU usage increased since my application is pretty small, but it didn't skyrocket at least, although it's better to test it with a CPU-heavy game.
I also wanted to look into X11 code, but then I figured even if I do find something, rebuilding X11 it would be a nightmare for me.
https://bugs.winehq.org/show_bug.cgi?id=57442
--- Comment #4 from William Horvath wine@horvath.blog --- (In reply to ksmnvsg from comment #3)
I've replaced the throttling method based off ticking with the one that Wine had before 9.13 version (based off number of messages) and this latency is gone.
If I understand Rémi correctly, the commit that simplified the driver event check also had the same queuing/latency issues, it just started to present itself in a different way.
Using a higher frequency 4000hz counter (e.g. with `NtQueryPerformanceCounter`) instead of the 1000hz `NtGetTickCount` in the commit, the consistent measurements from the old impl. were restored in my test until I started sending events at >4000hz, which makes sense.
I think a small increase in the frequency like this would go a long way in accommodating quite a few applications/setups where this sort of behavior is noticeable; but in any case, it doesn't look like an easy problem to fix properly.
https://bugs.winehq.org/show_bug.cgi?id=57442
--- Comment #5 from ksmnvsg@gmail.com --- (In reply to William Horvath from comment #4)
Sorry, I'm fairly new to all of this and don't quite understand what you mean. Are you saying this latency issue is still there regardless of throttling method on Wine side, and it's X11 driver's throttling that is to blame?
https://bugs.winehq.org/show_bug.cgi?id=57442
--- Comment #6 from Rémi Bernon rbernon@codeweavers.com --- We cannot really use the message count as a throttle anymore because of how peeking for messages have been optimized. Restoring it is pretty much the same as removing the throttle entirely, and it puts the load on the X server. It's maybe that visible, but it definitely adds some load, and makes a difference in various cases.
The calls to peek_message are now extremely fast, as in the most common case it's just about checking for bits in shared memory with wineserver, and the 200 message count (actually more a peek_message call counter) throttle isn't very effective.
That limit would need to be increased accordingly, but using a time based throttle is more deterministic, and yes NtQueryPerformanceCounter could be an option.
The proper way to support such high-frequency X11 input events is to wait for them instead of doing that polling we do, but that requires another large architectural redesign.
The input events have to go through wineserver, and with very high frequency input, receiving them in the application process just to route them through wineserver is inneficient, which is why I think it should instead be done in wineserver. There, they could be waited for, and it would save a lot of IPCs (replacing X -> app -> wineserver -> app, with X -> wineserver -> app).
https://bugs.winehq.org/show_bug.cgi?id=57442
--- Comment #7 from ksmnvsg@gmail.com --- (In reply to Rémi Bernon from comment #6) I see, that makes sense. I'm working on a project that requires a very low input latency and potentially a very high response speed, so I'm trying my best to optimize whatever I can optimize, but it's hard to measure potential drawbacks like CPU usage. I can't control the applications (for my testing I just mimicked a potential application), so my only options are either redesigning Wine to use interruption instead of polling, or reduce the number of IPCs, right?
Could you also please explain how input is rerouted from application to wineserver, and then back to the application? I didn't really understand why or how, and to be fair I don't really understand what wineserver does to begin with.
https://bugs.winehq.org/show_bug.cgi?id=57442
--- Comment #8 from Rémi Bernon rbernon@codeweavers.com --- You can see wineserver as the core of Wine NT kernel implementation, it implements various bits that need to work across processes, although the delimitation is not very well defined.
Input that is received from X server is sent through wineserver because it implements the Win32 hardware input dispatching logic, that might differ from the host dispatching, and because there's plenty of things that can and is supposed to happen to an input, in relation to other processes, like hooks, capture, or rawinput.
This is done through NtUserSendHardwareMessage calls, which is a generic function to send "hardware" input to be processed and dispatched. Later on, the input is going to be received back by applications through the expected Win32 calls, which can be window messages (through NtUserPeekMessage / NtUserGetMessage), rawinput buffers (NtUserGetRawinputData / NtUserGetRawinputBuffer), low level hooks (through the ll-hook registered procedures), etc...
https://bugs.winehq.org/show_bug.cgi?id=57442
--- Comment #9 from ksmnvsg@gmail.com --- (In reply to Rémi Bernon from comment #8) I think I got it. So, if I have an application with a separate thread made specifically for input polling (so its input polling rate will be very high), the only way to reduce latency without adding CPU usage is by putting input handling in wineserver side, and use interrupts there? This sounds like a lot of work.
https://bugs.winehq.org/show_bug.cgi?id=57442
Austin English austinenglish@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords| |regression Regression SHA1| |54ca1ab607d3ff22a1f57a95614 | |30f64c75f0916
https://bugs.winehq.org/show_bug.cgi?id=57442
Rémi Bernon rbernon@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED Fixed by SHA1| |b5a4c2f64ad07b0aaeddc2d8245 | |bc79ddb33b1f5
--- Comment #10 from Rémi Bernon rbernon@codeweavers.com --- Should be fixed after b5a4c2f64ad07b0aaeddc2d8245bc79ddb33b1f5
https://bugs.winehq.org/show_bug.cgi?id=57442
Alexandre Julliard julliard@winehq.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED
--- Comment #11 from Alexandre Julliard julliard@winehq.org --- Closing bugs fixed in 10.0-rc2.