On Mon, 16 Nov 2020, Francois Gouget wrote: [...]
So rather than just apply these patches and blindly increase the margin factor I dug some more to try to understand what is really going on. I'll try to get my theories and findings in shape in my next email.
So my understanding is that on Windows things look like this:
App -> Muxer -> driver -> soundcard
But in QEmu the sound card is in fact something like this:
QEmu -> Spice Server --network--> Spice client -> PulseAudio -> driver -> soundcard
And in particular the Spice server to client communication requires some buffering to deal with the network jitter. So I suspected that the 'QEMU soundcard' would initially consume some audio to fill its buffer and then resume a normal rate. So I tweaked the test so it looked like this:
Start() Sleep(25) last = GetPosition() Sleep(100) pos = GetPosition() compare pos - last to the expected value for 100 ms.
The theory was that QEmu would fill its buffer during the initial 25 ms sleep and would play at a regular rate during the next 100 ms. That was indeed a bit better but not really. Increasing the initial sleep to 50 ms improved things a bit more.
So I tweaked the test further to map out the effect of the initial sleep: * Increase the initial sleep (called RampUpDelay) from 0 to 200 ms in 5 ms increments. * For each run 40 tests. * Trace the position in milliseconds (Pos0) at the end of the initial sleep. Also trace the actual time spent sleeping (Slept0). * Then let the stream play for another 100 ms and measure how long we actually slept (Slept). * Get the final position (Pos) and trace the Pos / Slept ratio, which we expect to be below 1.1 (or 1.4 at the very least).
Then I parsed the log to produce graphs that show the effect of the RampUpDelay on Pos0 - Slept0 and on the Pos / Slept ratio.
I attached the corresponding patch which should be applied on top of the ones in the previous emails. You will also find all the data and graphs derived from these tests in the spreadsheet below:
http://fgouget.free.fr/tmp/mmdevapi-render.xls
And now here is an analysis of the results. Open the image attachements to follow along.
* Win10 on real hardware : cw-gtx560 + ich6 Fig: render-cw.png
- The first test configuration is what happens with real hardware, here the cw-gtx560 machine that runs WineTest daily. It has a very standard ich6 integrated audio soundcard which happens to match the sound card emulated by QEmu (ich6 or ich9).
- The top graph shows the Pos0-Slept0 values when playing audio straight to the soundcard, that is when share=0. In that case Pos0-Slept0 is essentially 0. The way I understand it is that the Windows API queries the soundcard driver to report the soundcard's playback position within the soundcard buffer.
- The other thing to notice is that although RampUpDelay is increased in 5 ms increments the top graph shows that the Slept0 values are really spread around. That proves Sleep() has at least a 5 ms jitter. In fact we already know from my past timers email that it's closer to 15 ms.
- The second graph shows the Pos0-Slept0 values again but when going through the Windows muxer, i.e. the equivalent to PulseAudio.
- This time Pos0-Slept0 stays close to -10 ms which presumably corresponds to the increments in which the muxer mixes the audio from the various sources. And there's a bit more variation (0 to -20 ms) when RampUpDelay is below 20 ms, most likely due to the muxer's buffering.
- There are also a few stray points near the -30 ms mark. Maybe these correspond to scheduling issues (cw-gtx560 was obviously idle but there may still have been some Windows services doing their thing, although most are normally disabled).
- Also there is less spread of the Slept0 values in the share=1 case: instead of a continous blur one sees short almost vertical lines. That contradicts the previous assertion about Sleep() jitter. Maybe the difference is that the Windows muxer changed the global Windows timer resolution. See timeBeginPeriod() and NtSetTimerResolution() and the previous timer dicussion: https://www.winehq.org/pipermail/wine-devel/2020-March/162846.html
On a related note, see this blog post which implies the timer management has changed a lot in Windows 10 2004. All QEmu tests were done with Windows 10 2009 (aka 20H2) but the cw-gtx560 tests were run in Windows 10 <= 1909 so a global timer change in the muxer could have impacted the mmdevapi_test process. https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-gr...
- The last graph shows that in the share=1 case the ratio stays very close to 1.0, except when RampUpDelay is less than 20 ms (i.e. the current test configuration) where it can be as low as 0.9 and as high as 1.1. These are probably the effect of the Windows muxer's initial buffering.
- In the share=0 case the ratio stays very close to 1.0 all along so I did not attach the corresponding graph.
* Win10 on QEmu 5.0 + ich9 + localhost Spice client Fig: render-spice-local.png
- The test gives 500 ms worth of data to mmdevapi. But what we see in the top graph is that mmdevapi does not instantly slurps 40 ms worth of data. Rather Pos0-Slept0, i.e the amount that is buffered, increases progressively, which means the audio is being pulled at a somewhat higher than expected playrate.
- Then after about 100 ms Pos0-Slept0 stabilizes between 25 and 45 ms. So in effect once the QEmu 'soundcard' has filled its buffer the playrate goes back to normal.
- Note how the above actually matches the 1.4 ratio we have had to use in the mmdevapi:render test.
- The graph in the middle is the same as the top one but indexed on Slept0, i.e. the real sleep wall time, rather than RampUpDelay, i.e. the parameter passed to Sleep(). It is at once similar and yet different from the graph above. In particular one can clearly see that initially Pos0-Slept0 is quantized. That probably has something to do with the audio being sent to the Spice client using the maximum size allowed by the Spice protocol initially. Then once the buffering target is reached, only enough data to refill the ~40 ms buffer is sent and that is more dependent on the actual time intervals and scheduling of the Spice server and client; resulting in a more random distribution of points.
- The graph at the bottom shows the ratio. What it shows is that initially the playrate is 40% higher than it should be, which corresponds to the 40 ms of extra buffering. Then that falls down progressively as more and more of our measurement interval is performed when the 40 ms buffer is already full.
- Once the 40 ms buffer is full, i.e. after the initial 100 ms, the ratio is very consistently between 1.0 and 1.1.
- The graphs when share=1 are very similar so I did not include them. The notable differences are that Pos0-Slept0 stays in a narrower band: 25 to 40 ms. But it takes an extra 20 ms for the Pos/Slept ratio to fall safely within the 1.0 to 1.1 range.
* Win10 on QEmu 3.1 + ich9 + remote Spice client
- If the buffering is meant to cover the network jitter, then more buffering would be needed for a remote client with a higher RTT, and presumably proportionally higher network jitter.
- So I repeated the test over a transatlantic connection with a 104 ms RTT with little jitter (rtt min/max/mdev = 103.934/104.858/0.253 ms).
- The Pos0-Slept0 graph did show a slightly increases amount of buffering in the share0 case but not by much. The Windows muxer still seems to insulate the tests from the buffering jitter and that graph did not change.
- Because there's not much change I did not include these graphs.
* Win10 on QEmu 5.0 + ich9 + no Spice client Fig: render-spice-noclient.png
- No client is connected to the Testbot VMs when they run Wine tests (except when the TestBot administrator tries to debug them). So it's interesting to test what happens when there is no Spice client. Also in that case there is no need to do any buffering to cover the network jitter or anything.
- The top graph shows Pos0-Slept0 in the share=0 case. Yikes!!! After buffering the initial 40 ms Pos0-Slept0 takes a nose dive: the 'soundcard' just stalls. It seems like it's waiting for the Spice client drain the buffer. But there is no Spice client.
- Predictably the Pos/Slept ratio falls to 0 in the share=0 case so I did not include this graph.
- The next graph shows what's happening with Windows' muxer. It's essentially the same graph except that we can see when the muxer gets scheduled and that the slope is not as steep, probably because the muxer drops some data when its buffer is full.
- In the share=1 case the Pos/Slept ratio stabilizes around 0.3 thanks to the Windows muxer dropping data.
- Notice how when there is no client the ratio can never be above 1.4. That means ok(pos*1000/freq <= slept*1.4,...) can never fail. And yet it does! See the next section for the explanation.
- Since the soundcard stalls all the tests that expect Pos to continue increasing will fail. And indeed that's what we see with w10pro64:
render.c:1310: Test failed: No position increase at iteration 1
- This is a QEmu 5.0 bug but I have not yet figured out if it has already been reported.
* Win10 on QEmu 3.1 + ich9 + no Spice client
- The QEmu 5.0 results where very perplexing... until I realized that most of the TestBot VMs run in QEmu 3.1. So I redid the tests with a QEmu 3.1 Windows 10 VM.
- If you want to see the graphs just open spice-local.png. They're again very similar.
- And that's why we get the '1.4 ratio' failures on most TestBot VMs. But at least we avoid the 'No position increase' ones.
- The VM hosts running QEmu 3.1 have the same libspice-server1 package (0.14.0-1.3+deb10u1) as the ones running QEmu 5.0. And the Spice client was the same in both cases. That's why I think the bug is in QEmu 5.0 rather than in Spice (unless it's a QEmu change that revealed a Spice bug).
* Win10 on QEmu 5.0 + ich9 + VNC
- So then I tested what happens if the QEmu remote access method is switched to VNC.
- Again I did not include the graphs because they are very similar to the spice-local.png ones, whether a VNC client is connected or not (but you can find them in the spreadsheet). The timings are a bit looser: there's a 51 ms Pos0-Slept0 datapoint in the no-client case but the ratio curve still remains between 1.0 and 1.1 after the initial 100 ms.
- When a VNC client is connected the spread is a bit higher with more Pos0-Slept0 datapoints close to the 50 ms mark and some stray ratios outside the 1.0 to 1.1 range below the 110 ms ramp up delay. So there's some difference between the VNC and Spice audio streaming algorithms (and maybe a bit more buffering). But that would not be an issue for the TestBot since there's no client normally.
* WinePulse cf49617 + Debian 10 on QEmu 5.0 + ich6 + localhost Spice Fig: render-spice-wine.png
- Finally I looked at how the test behaves in Wine with PulseAudio. Unfortunately the QEmu emulated hardware configuration is a bit different (440fx+ich6 instead of q35+ich9) but I doubt it has an impact.
- First note that with PulseAudio there is no direct access to the soundcard so we only get results for share=1.
- The most striking thing about the Pos0-Slept0 graph is that, besides a few exceptions, it has only 4 values: -5, -6, -10 or -11 ms! I don't know if this should be attributed to Winepulse, PulseAudio and/or Linux's much better scheduling.
- In any case there is absolutely no trace of the QEmu soundcard buffering. So something is insulating the test from it. Again I'm not sure if that's Winepulse, PulseAudio or the kernel itself.
- This graph also confirms that Sleep() has much less jitter than on Windows, even when Windows runs on real hardware. Slept0 is always within 1ms of the Sleep() value. So an OS running in QEmu can have pretty tight timings, at least more than 9% of the time.
- The ratio graph at the bottom is similar in that it has only three values: 0.9, 1.0 and 1.1. Given that the timings are much tighter than on Windows it's a bit strange that the ratio interval is not as tight as on Windows. But it is still well within the range accepted by this test.
- Not surprisingly the mmdevapi:render failures we used to get in Wine were not timing related.