So mmdevapi:render keeps failing randomly due to timing issues. All the random failures seem to follow this model:
render.c:1091: Test failed: Position 50880 too far after playing 100ms
1077 hr = IAudioClient_Start(ac); /* #1 */ 1080 Sleep(100); 1081 slept += 100; 1087 hr = IAudioClock_GetPosition(acl, &pos, NULL); 1090 /* in rare cases is slept*1.1 not enough with dmix */ 1091 ok(pos*1000/freq <= slept*1.4, "Position %u too far after playing %ums\n", (UINT)pos, slept);
There are a number of issues with that test:
* Sleep(100) never sleeps for 100 ms! It either sleeps for 93.6, 109.2 or even 124.8 ms depending on the Sleep() history. In any case it's always a multiple of 15.6 ms. So line 1081 is wrong.
* IAudioClient_Start() takes between 1.5 and 3 ms. There is no telling when the audio actually starts during that interval. So that adds up to 3 ms to the play time.
* In turn that means the 1.4 factor on line 1091 does not provide a 'safety margin' but really only compensates for the bug on line 1081.
* And in any case, given the very short duration, I feel this is going to test how the driver / hardware deals with buffering and latency more than anything else. (even if the driver claims the latency is 0)
So what is the goal of that test? I mean besides "Perform renderer padding & position tests".
Is the goal to check that an 8-bit stereo stream is not being played as a 16-bit stereo stream, i.e. twice as fast as it should?
Then, assuming we find a suitable duration for that test, wouldn't it make more sense to set the leeway factor to something between 1.5 (an even split between expected (1) and too fast (2)) and 1.8 (i.e. biased to account for buffering & co).
And we should be measuring how much time elapsed between just before the Start() call and right after the GetPosition() one to set our expectations.
Or... scrap the test?
On Mon, Nov 09, 2020 at 02:04:18PM +0100, Francois Gouget wrote:
So mmdevapi:render keeps failing randomly due to timing issues. All the random failures seem to follow this model:
render.c:1091: Test failed: Position 50880 too far after playing 100ms
1077 hr = IAudioClient_Start(ac); /* #1 */ 1080 Sleep(100); 1081 slept += 100; 1087 hr = IAudioClock_GetPosition(acl, &pos, NULL); 1090 /* in rare cases is slept*1.1 not enough with dmix */ 1091 ok(pos*1000/freq <= slept*1.4, "Position %u too far after playing %ums\n", (UINT)pos, slept);
There are a number of issues with that test:
Sleep(100) never sleeps for 100 ms! It either sleeps for 93.6, 109.2 or even 124.8 ms depending on the Sleep() history. In any case it's always a multiple of 15.6 ms. So line 1081 is wrong.
IAudioClient_Start() takes between 1.5 and 3 ms. There is no telling when the audio actually starts during that interval. So that adds up to 3 ms to the play time.
In turn that means the 1.4 factor on line 1091 does not provide a 'safety margin' but really only compensates for the bug on line 1081.
And in any case, given the very short duration, I feel this is going to test how the driver / hardware deals with buffering and latency more than anything else. (even if the driver claims the latency is 0)
So what is the goal of that test? I mean besides "Perform renderer padding & position tests".
Some applications are really picky about the relationship between GetPosition, GetCurrentPadding, and actual wall clock time. I think those tests were added to try to show that Wine is meeting the timing requirements that Windows meets. But clearly they're not doing a good job.
Is the goal to check that an 8-bit stereo stream is not being played as a 16-bit stereo stream, i.e. twice as fast as it should?
Then, assuming we find a suitable duration for that test, wouldn't it make more sense to set the leeway factor to something between 1.5 (an even split between expected (1) and too fast (2)) and 1.8 (i.e. biased to account for buffering & co).
And we should be measuring how much time elapsed between just before the Start() call and right after the GetPosition() one to set our expectations.
I'm not strongly opposed to removing them, but how about we try to improve how they track wall time? Something like the attached patch.
Andrew
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=81628
Your paranoid android.
=== w10pro64_2scr (32 bit report) ===
mmdevapi: render.c:1374: Test failed: GetCurrentPadding returned 480, should be 0 render.c:1375: Test failed: Position 426240 at end vs. 53760 submitted frames
=== w10pro64_ar (32 bit report) ===
mmdevapi: render.c:1374: Test failed: GetCurrentPadding returned 960, should be 0 render.c:1375: Test failed: Position 430080 at end vs. 54720 submitted frames
=== w10pro64_he (32 bit report) ===
mmdevapi: render.c:1374: Test failed: GetCurrentPadding returned 960, should be 0 render.c:1375: Test failed: Position 460800 at end vs. 58560 submitted frames
=== w10pro64_ja (32 bit report) ===
mmdevapi: render.c:1374: Test failed: GetCurrentPadding returned 480, should be 0 render.c:1375: Test failed: Position 549120 at end vs. 69120 submitted frames render.c:1334: Test failed: GetBuffer large (22500) at iteration 6 render.c:1334: Test failed: GetBuffer large (22500) at iteration 7
=== w864 (64 bit report) ===
mmdevapi: render.c:1334: Test failed: GetBuffer large (20671) at iteration 2 render.c:1334: Test failed: GetBuffer large (20671) at iteration 5 render.c:1334: Test failed: GetBuffer large (20671) at iteration 8
=== w10pro64 (64 bit report) ===
mmdevapi: render.c:1375: Test failed: Position 602880 at end vs. 80160 submitted frames
=== w10pro64_2scr (64 bit report) ===
mmdevapi: render.c:1374: Test failed: GetCurrentPadding returned 960, should be 0 render.c:1375: Test failed: Position 433920 at end vs. 55200 submitted frames
=== w10pro64_ar (64 bit report) ===
mmdevapi: render.c:1374: Test failed: GetCurrentPadding returned 480, should be 0 render.c:1375: Test failed: Position 449280 at end vs. 56640 submitted frames
=== w10pro64_ja (64 bit report) ===
mmdevapi: render.c:1374: Test failed: GetCurrentPadding returned 960, should be 0 render.c:1375: Test failed: Position 691200 at end vs. 87360 submitted frames
=== w10pro64_zh_CN (64 bit report) ===
mmdevapi: render.c:1375: Test failed: Position 602400 at end vs. 81540 submitted frames render.c:1334: Test failed: GetBuffer large (22500) at iteration 8
On Mon, 9 Nov 2020, Andrew Eikum wrote: [...]
I'm not strongly opposed to removing them, but how about we try to improve how they track wall time? Something like the attached patch.
It's better with that patch but still fails quite often.
To test this I focused on the first GetPosition() test, tweaked the ok() messages to show the position in milliseconds too, added a trace of the ratio of GetPosition() / slept (which we expect to be below 1.1 or at least 1.4), added a patch to loop 100 times over this test, and still did 3 runs. See the attached:
0001-mmdevapi-render-Trace-the-position-in-milliseconds-t.patch 0002-mmdevapi-render-Make-sure-to-get-an-upper-bound-on-t.patch 0003-mmdevapi-render-Loop-and-trace-ratio-between-positio.patch
On Windows 10 in QEmu this shows the test still often goes above the 1.4 limit.
See: since-win10+qemu+local.png
So then I tweaked the slept interval to start right before the Start() call (which takes 1.5 to 3 ms), and end right after GetPosition() returns. This way we are sure that slept is an upper bound on the pos value.
That improves the share=0 case but paradoxically the share=1 case sees more spread and higher values. There may have been more activity on the host but then I would have expected this to impact the share=0 case too. In each case we still get at least 1 value that's too high and many that are close:
See: upper-win10+qemu+local.png
(I also tested this with an ich6 virtual QEmu sound card but it made no difference)
The same test on real hardware (Windows 10 on cw-gtx560) shows radically different results, results that are actually in line with the test expectations:
See: upper-win10+gtx560.png
So rather than just apply these patches and blindly increase the margin factor I dug some more to try to understand what is really going on. I'll try to get my theories and findings in shape in my next email.
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=81913
Your paranoid android.
=== build (build log) ===
error: patch failed: dlls/mmdevapi/tests/render.c:1090 Task: Patch failed to apply
=== debiant (build log) ===
error: patch failed: dlls/mmdevapi/tests/render.c:1090 Task: Patch failed to apply
=== debiant (build log) ===
error: patch failed: dlls/mmdevapi/tests/render.c:1090 Task: Patch failed to apply
On Mon, 16 Nov 2020, Francois Gouget wrote: [...]
So rather than just apply these patches and blindly increase the margin factor I dug some more to try to understand what is really going on. I'll try to get my theories and findings in shape in my next email.
So my understanding is that on Windows things look like this:
App -> Muxer -> driver -> soundcard
But in QEmu the sound card is in fact something like this:
QEmu -> Spice Server --network--> Spice client -> PulseAudio -> driver -> soundcard
And in particular the Spice server to client communication requires some buffering to deal with the network jitter. So I suspected that the 'QEMU soundcard' would initially consume some audio to fill its buffer and then resume a normal rate. So I tweaked the test so it looked like this:
Start() Sleep(25) last = GetPosition() Sleep(100) pos = GetPosition() compare pos - last to the expected value for 100 ms.
The theory was that QEmu would fill its buffer during the initial 25 ms sleep and would play at a regular rate during the next 100 ms. That was indeed a bit better but not really. Increasing the initial sleep to 50 ms improved things a bit more.
So I tweaked the test further to map out the effect of the initial sleep: * Increase the initial sleep (called RampUpDelay) from 0 to 200 ms in 5 ms increments. * For each run 40 tests. * Trace the position in milliseconds (Pos0) at the end of the initial sleep. Also trace the actual time spent sleeping (Slept0). * Then let the stream play for another 100 ms and measure how long we actually slept (Slept). * Get the final position (Pos) and trace the Pos / Slept ratio, which we expect to be below 1.1 (or 1.4 at the very least).
Then I parsed the log to produce graphs that show the effect of the RampUpDelay on Pos0 - Slept0 and on the Pos / Slept ratio.
I attached the corresponding patch which should be applied on top of the ones in the previous emails. You will also find all the data and graphs derived from these tests in the spreadsheet below:
http://fgouget.free.fr/tmp/mmdevapi-render.xls
And now here is an analysis of the results. Open the image attachements to follow along.
* Win10 on real hardware : cw-gtx560 + ich6 Fig: render-cw.png
- The first test configuration is what happens with real hardware, here the cw-gtx560 machine that runs WineTest daily. It has a very standard ich6 integrated audio soundcard which happens to match the sound card emulated by QEmu (ich6 or ich9).
- The top graph shows the Pos0-Slept0 values when playing audio straight to the soundcard, that is when share=0. In that case Pos0-Slept0 is essentially 0. The way I understand it is that the Windows API queries the soundcard driver to report the soundcard's playback position within the soundcard buffer.
- The other thing to notice is that although RampUpDelay is increased in 5 ms increments the top graph shows that the Slept0 values are really spread around. That proves Sleep() has at least a 5 ms jitter. In fact we already know from my past timers email that it's closer to 15 ms.
- The second graph shows the Pos0-Slept0 values again but when going through the Windows muxer, i.e. the equivalent to PulseAudio.
- This time Pos0-Slept0 stays close to -10 ms which presumably corresponds to the increments in which the muxer mixes the audio from the various sources. And there's a bit more variation (0 to -20 ms) when RampUpDelay is below 20 ms, most likely due to the muxer's buffering.
- There are also a few stray points near the -30 ms mark. Maybe these correspond to scheduling issues (cw-gtx560 was obviously idle but there may still have been some Windows services doing their thing, although most are normally disabled).
- Also there is less spread of the Slept0 values in the share=1 case: instead of a continous blur one sees short almost vertical lines. That contradicts the previous assertion about Sleep() jitter. Maybe the difference is that the Windows muxer changed the global Windows timer resolution. See timeBeginPeriod() and NtSetTimerResolution() and the previous timer dicussion: https://www.winehq.org/pipermail/wine-devel/2020-March/162846.html
On a related note, see this blog post which implies the timer management has changed a lot in Windows 10 2004. All QEmu tests were done with Windows 10 2009 (aka 20H2) but the cw-gtx560 tests were run in Windows 10 <= 1909 so a global timer change in the muxer could have impacted the mmdevapi_test process. https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-gr...
- The last graph shows that in the share=1 case the ratio stays very close to 1.0, except when RampUpDelay is less than 20 ms (i.e. the current test configuration) where it can be as low as 0.9 and as high as 1.1. These are probably the effect of the Windows muxer's initial buffering.
- In the share=0 case the ratio stays very close to 1.0 all along so I did not attach the corresponding graph.
* Win10 on QEmu 5.0 + ich9 + localhost Spice client Fig: render-spice-local.png
- The test gives 500 ms worth of data to mmdevapi. But what we see in the top graph is that mmdevapi does not instantly slurps 40 ms worth of data. Rather Pos0-Slept0, i.e the amount that is buffered, increases progressively, which means the audio is being pulled at a somewhat higher than expected playrate.
- Then after about 100 ms Pos0-Slept0 stabilizes between 25 and 45 ms. So in effect once the QEmu 'soundcard' has filled its buffer the playrate goes back to normal.
- Note how the above actually matches the 1.4 ratio we have had to use in the mmdevapi:render test.
- The graph in the middle is the same as the top one but indexed on Slept0, i.e. the real sleep wall time, rather than RampUpDelay, i.e. the parameter passed to Sleep(). It is at once similar and yet different from the graph above. In particular one can clearly see that initially Pos0-Slept0 is quantized. That probably has something to do with the audio being sent to the Spice client using the maximum size allowed by the Spice protocol initially. Then once the buffering target is reached, only enough data to refill the ~40 ms buffer is sent and that is more dependent on the actual time intervals and scheduling of the Spice server and client; resulting in a more random distribution of points.
- The graph at the bottom shows the ratio. What it shows is that initially the playrate is 40% higher than it should be, which corresponds to the 40 ms of extra buffering. Then that falls down progressively as more and more of our measurement interval is performed when the 40 ms buffer is already full.
- Once the 40 ms buffer is full, i.e. after the initial 100 ms, the ratio is very consistently between 1.0 and 1.1.
- The graphs when share=1 are very similar so I did not include them. The notable differences are that Pos0-Slept0 stays in a narrower band: 25 to 40 ms. But it takes an extra 20 ms for the Pos/Slept ratio to fall safely within the 1.0 to 1.1 range.
* Win10 on QEmu 3.1 + ich9 + remote Spice client
- If the buffering is meant to cover the network jitter, then more buffering would be needed for a remote client with a higher RTT, and presumably proportionally higher network jitter.
- So I repeated the test over a transatlantic connection with a 104 ms RTT with little jitter (rtt min/max/mdev = 103.934/104.858/0.253 ms).
- The Pos0-Slept0 graph did show a slightly increases amount of buffering in the share0 case but not by much. The Windows muxer still seems to insulate the tests from the buffering jitter and that graph did not change.
- Because there's not much change I did not include these graphs.
* Win10 on QEmu 5.0 + ich9 + no Spice client Fig: render-spice-noclient.png
- No client is connected to the Testbot VMs when they run Wine tests (except when the TestBot administrator tries to debug them). So it's interesting to test what happens when there is no Spice client. Also in that case there is no need to do any buffering to cover the network jitter or anything.
- The top graph shows Pos0-Slept0 in the share=0 case. Yikes!!! After buffering the initial 40 ms Pos0-Slept0 takes a nose dive: the 'soundcard' just stalls. It seems like it's waiting for the Spice client drain the buffer. But there is no Spice client.
- Predictably the Pos/Slept ratio falls to 0 in the share=0 case so I did not include this graph.
- The next graph shows what's happening with Windows' muxer. It's essentially the same graph except that we can see when the muxer gets scheduled and that the slope is not as steep, probably because the muxer drops some data when its buffer is full.
- In the share=1 case the Pos/Slept ratio stabilizes around 0.3 thanks to the Windows muxer dropping data.
- Notice how when there is no client the ratio can never be above 1.4. That means ok(pos*1000/freq <= slept*1.4,...) can never fail. And yet it does! See the next section for the explanation.
- Since the soundcard stalls all the tests that expect Pos to continue increasing will fail. And indeed that's what we see with w10pro64:
render.c:1310: Test failed: No position increase at iteration 1
- This is a QEmu 5.0 bug but I have not yet figured out if it has already been reported.
* Win10 on QEmu 3.1 + ich9 + no Spice client
- The QEmu 5.0 results where very perplexing... until I realized that most of the TestBot VMs run in QEmu 3.1. So I redid the tests with a QEmu 3.1 Windows 10 VM.
- If you want to see the graphs just open spice-local.png. They're again very similar.
- And that's why we get the '1.4 ratio' failures on most TestBot VMs. But at least we avoid the 'No position increase' ones.
- The VM hosts running QEmu 3.1 have the same libspice-server1 package (0.14.0-1.3+deb10u1) as the ones running QEmu 5.0. And the Spice client was the same in both cases. That's why I think the bug is in QEmu 5.0 rather than in Spice (unless it's a QEmu change that revealed a Spice bug).
* Win10 on QEmu 5.0 + ich9 + VNC
- So then I tested what happens if the QEmu remote access method is switched to VNC.
- Again I did not include the graphs because they are very similar to the spice-local.png ones, whether a VNC client is connected or not (but you can find them in the spreadsheet). The timings are a bit looser: there's a 51 ms Pos0-Slept0 datapoint in the no-client case but the ratio curve still remains between 1.0 and 1.1 after the initial 100 ms.
- When a VNC client is connected the spread is a bit higher with more Pos0-Slept0 datapoints close to the 50 ms mark and some stray ratios outside the 1.0 to 1.1 range below the 110 ms ramp up delay. So there's some difference between the VNC and Spice audio streaming algorithms (and maybe a bit more buffering). But that would not be an issue for the TestBot since there's no client normally.
* WinePulse cf49617 + Debian 10 on QEmu 5.0 + ich6 + localhost Spice Fig: render-spice-wine.png
- Finally I looked at how the test behaves in Wine with PulseAudio. Unfortunately the QEmu emulated hardware configuration is a bit different (440fx+ich6 instead of q35+ich9) but I doubt it has an impact.
- First note that with PulseAudio there is no direct access to the soundcard so we only get results for share=1.
- The most striking thing about the Pos0-Slept0 graph is that, besides a few exceptions, it has only 4 values: -5, -6, -10 or -11 ms! I don't know if this should be attributed to Winepulse, PulseAudio and/or Linux's much better scheduling.
- In any case there is absolutely no trace of the QEmu soundcard buffering. So something is insulating the test from it. Again I'm not sure if that's Winepulse, PulseAudio or the kernel itself.
- This graph also confirms that Sleep() has much less jitter than on Windows, even when Windows runs on real hardware. Slept0 is always within 1ms of the Sleep() value. So an OS running in QEmu can have pretty tight timings, at least more than 9% of the time.
- The ratio graph at the bottom is similar in that it has only three values: 0.9, 1.0 and 1.1. Given that the timings are much tighter than on Windows it's a bit strange that the ratio interval is not as tight as on Windows. But it is still well within the range accepted by this test.
- Not surprisingly the mmdevapi:render failures we used to get in Wine were not timing related.
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=82396
Your paranoid android.
=== build (build log) ===
error: patch failed: dlls/mmdevapi/tests/render.c:946 Task: Patch failed to apply
=== debiant (build log) ===
error: patch failed: dlls/mmdevapi/tests/render.c:946 Task: Patch failed to apply
=== debiant (build log) ===
error: patch failed: dlls/mmdevapi/tests/render.c:946 Task: Patch failed to apply
Takeaway and todos from the previous email: (TL;DR; sort of)
* QEmu 5.0 is buggy: audio playback stalls if no Spice client is connected. -> Report the bug to QEmu.
* This causes failures in the w10pro64 VM. -> QEmu 3.1 does not have this bug but w10pro64 was created with QEmu 5.0 so I'm not sure a downgrade can work (iirc it at least causes an error when I try to power it up). -> So I think I will switch w10pro64 to VNC instead since that works around the bug too.
* When there is a Spice or VNC client ~40 ms of audio is buffered when the playback starts. That's why we need to allow up to 140 ms of audio playback for our 100 ms playback test. Importantly this has nothing to with scheduling delays caused by QEmu or whatever.
* The buffering seems unavoidable when a client is connected to the VM. But during the normal TestBot operation no client is connected. In that case QEmu should be able to skip the buffering but does not. -> That could be a QEmu improvement. This may be too specific to our needs and thus may not be accepted upstream. -> Investigate anyway? Submit an RFE?
* After the initial buffering the playback proceeds at a normal rate. -> Either increase the initial margin from 1.4 to at least 1.5. -> Or modify perform the playback rate tests after a 120 ms 'ramp up delay'. -> The latter may not be practical when testing what happens when stopping and restarting playback.
* This analysis is applicable to all the audio playback tests that look at playback timings.
* The capture tests may run into similar issues too.
* Winepulse in QEmu seems to be immune to this initial buffering for unknown reasons.