Re: mmdevapi:render test failures

24 Nov 2020


      On Mon, 16 Nov 2020, Francois Gouget wrote:
[...]
...
So rather than just apply these patches and blindly increase the margin 
factor I dug some more to try to understand what is really going on.
I'll try to get my theories and findings in shape in my next email.
So my understanding is that on Windows things look like this:
App -> Muxer -> driver -> soundcard
But in QEmu the sound card is in fact something like this:
QEmu -> Spice Server --network--> Spice client -> PulseAudio -> driver -> soundcard
And in particular the Spice server to client communication requires some 
buffering to deal with the network jitter. So I suspected that the 'QEMU 
soundcard' would initially consume some audio to fill its buffer and 
then resume a normal rate. So I tweaked the test so it looked 
like this:
Start()
    Sleep(25)
    last = GetPosition()
    Sleep(100)
    pos = GetPosition()
    compare pos - last to the expected value for 100 ms.
The theory was that QEmu would fill its buffer during the initial 25 ms 
sleep and would play at a regular rate during the next 100 ms. That was 
indeed a bit better but not really. Increasing the initial sleep to 50 
ms improved things a bit more.
So I tweaked the test further to map out the effect of the initial 
sleep:
* Increase the initial sleep (called RampUpDelay) from 0 to 200 ms 
  in 5 ms increments.
* For each run 40 tests.
* Trace the position in milliseconds (Pos0) at the end of the initial 
  sleep. Also trace the actual time spent sleeping (Slept0).
* Then let the stream play for another 100 ms and measure how long 
  we actually slept (Slept).
* Get the final position (Pos) and trace the Pos / Slept ratio, 
  which we expect to be below 1.1 (or 1.4 at the very least).
Then I parsed the log to produce graphs that show the effect of the 
RampUpDelay on Pos0 - Slept0 and on the Pos / Slept ratio.
I attached the corresponding patch which should be applied on top of 
the ones in the previous emails. You will also find all the data and 
graphs derived from these tests in the spreadsheet below:
http://fgouget.free.fr/tmp/mmdevapi-render.xls
And now here is an analysis of the results. Open the image attachements 
to follow along.
* Win10 on real hardware : cw-gtx560 + ich6
  Fig: render-cw.png
- The first test configuration is what happens with real hardware, 
    here the cw-gtx560 machine that runs WineTest daily. It has a very 
    standard ich6 integrated audio soundcard which happens to match the 
    sound card emulated by QEmu (ich6 or ich9).
- The top graph shows the Pos0-Slept0 values when playing audio 
    straight to the soundcard, that is when share=0. In that case 
    Pos0-Slept0 is essentially 0. The way I understand it is that the 
    Windows API queries the soundcard driver to report the soundcard's 
    playback position within the soundcard buffer.
- The other thing to notice is that although RampUpDelay is increased 
    in 5 ms increments the top graph shows that the Slept0 values are 
    really spread around. That proves Sleep() has at least a 5 ms 
    jitter. In fact we already know from my past timers email that it's 
    closer to 15 ms.
- The second graph shows the Pos0-Slept0 values again but when going 
    through the Windows muxer, i.e. the equivalent to PulseAudio.
- This time Pos0-Slept0 stays close to -10 ms which presumably 
    corresponds to the increments in which the muxer mixes the audio 
    from the various sources. And there's a bit more variation (0 to 
    -20 ms) when RampUpDelay is below 20 ms, most likely due to the 
    muxer's buffering.
- There are also a few stray points near the -30 ms mark. Maybe these 
    correspond to scheduling issues (cw-gtx560 was obviously idle but 
    there may still have been some Windows services doing their thing, 
    although most are normally disabled).
- Also there is less spread of the Slept0 values in the share=1 case: 
    instead of a continous blur one sees short almost vertical lines. 
    That contradicts the previous assertion about Sleep() jitter. Maybe 
    the difference is that the Windows muxer changed the global Windows 
    timer resolution. See timeBeginPeriod() and NtSetTimerResolution() 
    and the previous timer dicussion: 
    https://www.winehq.org/pipermail/wine-devel/2020-March/162846.html
On a related note, see this blog post which implies the timer 
    management has changed a lot in Windows 10 2004. All QEmu tests were 
    done with Windows 10 2009 (aka 20H2) but the cw-gtx560 tests were 
    run in Windows 10 <= 1909 so a global timer change in the muxer 
    could have impacted the mmdevapi_test process. 
    https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-gr...
- The last graph shows that in the share=1 case the ratio stays very 
    close to 1.0, except when RampUpDelay is less than 20 ms (i.e. the 
    current test configuration) where it can be as low as 0.9 and as 
    high as 1.1. These are probably the effect of the Windows muxer's 
    initial buffering.
- In the share=0 case the ratio stays very close to 1.0 all along so I 
    did not attach the corresponding graph.
* Win10 on QEmu 5.0 + ich9 + localhost Spice client
  Fig: render-spice-local.png
- The test gives 500 ms worth of data to mmdevapi. But what we see in 
    the top graph is that mmdevapi does not instantly slurps 40 ms worth 
    of data. Rather Pos0-Slept0, i.e the amount that is buffered, 
    increases progressively, which means the audio is being pulled at a 
    somewhat higher than expected playrate.
- Then after about 100 ms Pos0-Slept0 stabilizes between 25 and 45 ms. 
    So in effect once the QEmu 'soundcard' has filled its buffer the 
    playrate goes back to normal.
- Note how the above actually matches the 1.4 ratio we have had to use 
    in the mmdevapi:render test.
- The graph in the middle is the same as the top one but indexed on 
    Slept0, i.e. the real sleep wall time, rather than RampUpDelay, i.e. 
    the parameter passed to Sleep(). It is at once similar and yet 
    different from the graph above. In particular one can clearly see 
    that initially Pos0-Slept0 is quantized. That probably has something 
    to do with the audio being sent to the Spice client using the 
    maximum size allowed by the Spice protocol initially. Then once the 
    buffering target is reached, only enough data to refill the ~40 ms 
    buffer is sent and that is more dependent on the actual time 
    intervals and scheduling of the Spice server and client; resulting 
    in a more random distribution of points.
- The graph at the bottom shows the ratio. What it shows is that 
    initially the playrate is 40% higher than it should be, which 
    corresponds to the 40 ms of extra buffering. Then that falls down 
    progressively as more and more of our measurement interval is 
    performed when the 40 ms buffer is already full.
- Once the 40 ms buffer is full, i.e. after the initial 100 ms, the 
    ratio is very consistently between 1.0 and 1.1.
- The graphs when share=1 are very similar so I did not include them. 
    The notable differences are that Pos0-Slept0 stays in a narrower 
    band: 25 to 40 ms. But it takes an extra 20 ms for the Pos/Slept 
    ratio to fall safely within the 1.0 to 1.1 range.
* Win10 on QEmu 3.1 + ich9 + remote Spice client
- If the buffering is meant to cover the network jitter, then more 
    buffering would be needed for a remote client with a higher RTT, and 
    presumably proportionally higher network jitter.
- So I repeated the test over a transatlantic connection with a 104 ms 
    RTT with little jitter (rtt min/max/mdev = 103.934/104.858/0.253 ms).
- The Pos0-Slept0 graph did show a slightly increases amount of 
    buffering in the share0 case but not by much. The Windows muxer 
    still seems to insulate the tests from the buffering jitter and that 
    graph did not change.
- Because there's not much change I did not include these graphs.
* Win10 on QEmu 5.0 + ich9 + no Spice client
  Fig: render-spice-noclient.png
- No client is connected to the Testbot VMs when they run Wine tests 
    (except when the TestBot administrator tries to debug them). So it's 
    interesting to test what happens when there is no Spice client. Also 
    in that case there is no need to do any buffering to cover the 
    network jitter or anything.
- The top graph shows Pos0-Slept0 in the share=0 case. Yikes!!!
    After buffering the initial 40 ms Pos0-Slept0 takes a nose dive: the 
    'soundcard' just stalls. It seems like it's waiting for the Spice 
    client drain the buffer. But there is no Spice client.
- Predictably the Pos/Slept ratio falls to 0 in the share=0 case so I 
    did not include this graph.
- The next graph shows what's happening with Windows' muxer. It's 
    essentially the same graph except that we can see when the muxer 
    gets scheduled and that the slope is not as steep, probably because 
    the muxer drops some data when its buffer is full.
- In the share=1 case the Pos/Slept ratio stabilizes around 0.3 thanks 
    to the Windows muxer dropping data.
- Notice how when there is no client the ratio can never be above 1.4. 
    That means ok(pos*1000/freq <= slept*1.4,...) can never fail. And 
    yet it does! See the next section for the explanation.
- Since the soundcard stalls all the tests that expect Pos to continue 
    increasing will fail. And indeed that's what we see with w10pro64:
render.c:1310: Test failed: No position increase at iteration 1
- This is a QEmu 5.0 bug but I have not yet figured out if it has 
    already been reported.
* Win10 on QEmu 3.1 + ich9 + no Spice client
- The QEmu 5.0 results where very perplexing... until I realized that 
    most of the TestBot VMs run in QEmu 3.1. So I redid the tests with a 
    QEmu 3.1 Windows 10 VM.
- If you want to see the graphs just open spice-local.png. They're 
    again very similar.
- And that's why we get the '1.4 ratio' failures on most TestBot VMs. 
    But at least we avoid the 'No position increase' ones.
- The VM hosts running QEmu 3.1 have the same libspice-server1 package 
    (0.14.0-1.3+deb10u1) as the ones running QEmu 5.0. And the Spice 
    client was the same in both cases. That's why I think the bug is in 
    QEmu 5.0 rather than in Spice (unless it's a QEmu change that 
    revealed a Spice bug).
* Win10 on QEmu 5.0 + ich9 + VNC
- So then I tested what happens if the QEmu remote access method is 
    switched to VNC.
- Again I did not include the graphs because they are very 
    similar to the spice-local.png ones, whether a VNC client is 
    connected or not (but you can find them in the spreadsheet). The 
    timings are a bit looser: there's a 51 ms Pos0-Slept0 datapoint in 
    the no-client case but the ratio curve still remains between 1.0 and 
    1.1 after the initial 100 ms.
- When a VNC client is connected the spread is a bit higher with more 
    Pos0-Slept0 datapoints close to the 50 ms mark and some stray ratios 
    outside the 1.0 to 1.1 range below the 110 ms ramp up delay. So 
    there's some difference between the VNC and Spice audio streaming 
    algorithms (and maybe a bit more buffering). But that would not be 
    an issue for the TestBot since there's no client normally.
* WinePulse cf49617 + Debian 10 on QEmu 5.0 + ich6 + localhost Spice
  Fig: render-spice-wine.png
- Finally I looked at how the test behaves in Wine with PulseAudio.
    Unfortunately the QEmu emulated hardware configuration is a bit 
    different (440fx+ich6 instead of q35+ich9) but I doubt it has an 
    impact.
- First note that with PulseAudio there is no direct access to
    the soundcard so we only get results for share=1.
- The most striking thing about the Pos0-Slept0 graph is that, besides 
    a few exceptions, it has only 4 values: -5, -6, -10 or -11 ms!
    I don't know if this should be attributed to Winepulse, PulseAudio 
    and/or Linux's much better scheduling.
- In any case there is absolutely no trace of the QEmu soundcard 
    buffering. So something is insulating the test from it. Again I'm 
    not sure if that's Winepulse, PulseAudio or the kernel itself.
- This graph also confirms that Sleep() has much less jitter than on 
    Windows, even when Windows runs on real hardware. Slept0 is always 
    within 1ms of the Sleep() value. So an OS running in QEmu can have 
    pretty tight timings, at least more than 9% of the time.
- The ratio graph at the bottom is similar in that it has only three 
    values: 0.9, 1.0 and 1.1. Given that the timings are much tighter 
    than on Windows it's a bit strange that the ratio interval is not as 
    tight as on Windows. But it is still well within the range accepted 
    by this test.
- Not surprisingly the mmdevapi:render failures we used to get in 
    Wine were not timing related.
-- 
Francois Gouget fgouget@codeweavers.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: mmdevapi:render test failures