TL;DR; * Windows is bad at hitting precise timing targets. * The 15.6 ms Windows timer granularity can have an impact on the tests reliability. * But some unknown factor is causing large delays which are the source of most timing related failures. * The same delays probably also cause the wait timed out failures. * Avoiding these failures in the timer tests is tricky. -> Find the source of these delays and figure out how to avoid them.
My test subject for this dive was test_tp_timer() in ntdll:threadpool, and more specifically the loop which we expect to run in 500 to 750 ms:
https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/ntdll/tests/threadpoo...
* Recurring timers are nice because one knows exactly how often they should generate an event which makes it easy to see how badly and how often they miss. This is unlike most WaitForSingleObject() calls where one does not know how much time will elapse and all one gets is the unpredictable one-off timeout.
* Wine on Linux is waaay better than Windows at ensuring the timer events arrive precisely on time. Even in a QEmu VM on an otherwise busy machine, every event arrives within 1 ms of the expected time (so in this case, in Wine we go through the loop once every 200 ms and rarely 199 or 201 ms).
* On Windows the default timer resolution is 15.625 ms. So asking for a 200 ms timer interval as in this test is asking for trouble: Windows will round that to 13 * 15.625 ms = 203 ms and then occasionally make the interval shorter so the average stays around 200 ms. So we typically get a pattern of 203 203 203 203 188 ms and again 203...
* Even on real hardware (Windows 8.1 on cw-rx460), Windows can sometimes get into a rut, alternating between 172 ms and 219 ms for instance. So if I tweak the test to go through more loops I'd sometimes get only 4 out of 14 events arriving within 190-210 ms of the previous one! Whether this happens is heavily influenced by the other processes running on the machine, such as playing a video in IE.
* This means that based on just a few events one cannot expect the interval between most events to fall within a narrow range. So here for instance if the acceptable interval is 190-210 ms and the first interval is instead 237 ms, then the next one will necessarily be out of range too, and likely the one after that too. So expecting 2 out of 3 intervals to be within the range is no more reliable than checking just one interval.
* The timer resolution can be changed with NtSetTimerResolution() and increased to 0.5 ms. But that does not change the results: one still gets 203, 203, 203, 203, 188 ms. My understanding is that increasing the timer resolution really only impacts timers that request a higher resolution. But that means testing a different API.
* At least increasing the timer resolution seems to prevent Windows from getting stuck alternating between 172 and 219 ms.
* Interestingly if one starts a Disk Cleanup and then runs "Clean up system files", the analysis of the system files drops the timer interval to 0.5 ms! So trying to see the impact of disk activity can modify the results in unexpected ways.
* One can check the current timer resolution with ClockRes command line tool: https://docs.microsoft.com/en-us/sysinternals/downloads/clockres
* When the delay is greater than the timer interval, e.g. 437 ms, the timer does not try to compensate for the lost events. But it does try to compensate for the fractional interval remainder. So one sees:
Expected Actual Loop Time Time Interval Lateness 1 200 437 437 200+37 2 400 609 172 200+9 3 600 812 203 200+12 4 800 1000 188 200
* In QEmu, when the timer misses it often misses big: 437 ms, 687 ms, even 1469 ms. So most of the time expecting three events to take about 3 intervals does not help with reliability because the timer does not try to compensate the missed events. So at the end it will still be off by one interval (200 ms) or more.
* I could not reproduce these big misses on the Windows 8.1 on cw-rx460 machine (i.e. real hardware).
* Conclusion: - Because Windows constantly adjusts the timer interval so that they are only accurate *on average*, testing the total time for N events will be more reliable than testing the events individually.
- But the occasional big delays mean any timing tolerances below ~1700 ms will be somewhat unreliable :-(
- For timers one could check if a big delay happened in one of the loops and either account for it or reset the timer and retry in the hope no delay will occur then. But that's overly complex.
- The same big delays are probably also the cause of the WaitForSingleObject() failures when the timeout is < 2000 ms.
- The timing in Wine in Debian VMs is very reliable. Yet it is likely that QEmu is somehow responsible for the timing issues in the Windows VMs.
- The best lead at this point is to check whether some VM clock setting is causing this. But some are also need to prevent reverts from taking forever[1]. (w8) <clock offset='localtime'> <timer name='rtc' tickpolicy='delay'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> vs. (w1064) <hyperv> <relaxed state='on'/> <vapic state='on'/> <spinlocks state='on' retries='8191'/> </hyperv> <clock offset='localtime'> <timer name='rtc' tickpolicy='catchup' track='guest'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> <timer name='hypervclock' present='yes'/> </clock> vs. (fgtbdebian10) <clock offset='utc'/>
- Otherwise try CPU pinning. But Linux seems to do really well without it...
[1] https://bugs.launchpad.net/qemu/+bug/1505041 https://bugs.launchpad.net/qemu/+bug/1174654
On 3/24/20 10:48 PM, Francois Gouget wrote:
TL;DR;
- Windows is bad at hitting precise timing targets.
- The 15.6 ms Windows timer granularity can have an impact on the tests reliability.
- But some unknown factor is causing large delays which are the source of most timing related failures.
- The same delays probably also cause the wait timed out failures.
- Avoiding these failures in the timer tests is tricky. -> Find the source of these delays and figure out how to avoid them.
My test subject for this dive was test_tp_timer() in ntdll:threadpool, and more specifically the loop which we expect to run in 500 to 750 ms:
https://source.winehq.org/git/wine.git/blob/HEAD:/dlls/ntdll/tests/threadpoo...
Recurring timers are nice because one knows exactly how often they should generate an event which makes it easy to see how badly and how often they miss. This is unlike most WaitForSingleObject() calls where one does not know how much time will elapse and all one gets is the unpredictable one-off timeout.
Wine on Linux is waaay better than Windows at ensuring the timer events arrive precisely on time. Even in a QEmu VM on an otherwise busy machine, every event arrives within 1 ms of the expected time (so in this case, in Wine we go through the loop once every 200 ms and rarely 199 or 201 ms).
On Windows the default timer resolution is 15.625 ms. So asking for a 200 ms timer interval as in this test is asking for trouble: Windows will round that to 13 * 15.625 ms = 203 ms and then occasionally make the interval shorter so the average stays around 200 ms. So we typically get a pattern of 203 203 203 203 188 ms and again 203...
Even on real hardware (Windows 8.1 on cw-rx460), Windows can sometimes get into a rut, alternating between 172 ms and 219 ms for instance. So if I tweak the test to go through more loops I'd sometimes get only 4 out of 14 events arriving within 190-210 ms of the previous one! Whether this happens is heavily influenced by the other processes running on the machine, such as playing a video in IE.
This means that based on just a few events one cannot expect the interval between most events to fall within a narrow range. So here for instance if the acceptable interval is 190-210 ms and the first interval is instead 237 ms, then the next one will necessarily be out of range too, and likely the one after that too. So expecting 2 out of 3 intervals to be within the range is no more reliable than checking just one interval.
Allowing for more error than 10ms seems reasonable to me, even by an order of magnitude.
The timer resolution can be changed with NtSetTimerResolution() and increased to 0.5 ms. But that does not change the results: one still gets 203, 203, 203, 203, 188 ms. My understanding is that increasing the timer resolution really only impacts timers that request a higher resolution. But that means testing a different API.
At least increasing the timer resolution seems to prevent Windows from getting stuck alternating between 172 and 219 ms.
Interestingly if one starts a Disk Cleanup and then runs "Clean up system files", the analysis of the system files drops the timer interval to 0.5 ms! So trying to see the impact of disk activity can modify the results in unexpected ways.
One can check the current timer resolution with ClockRes command line tool: https://docs.microsoft.com/en-us/sysinternals/downloads/clockres
When the delay is greater than the timer interval, e.g. 437 ms, the timer does not try to compensate for the lost events. But it does try to compensate for the fractional interval remainder. So one sees:
Expected Actual
Loop Time Time Interval Lateness 1 200 437 437 200+37 2 400 609 172 200+9 3 600 812 203 200+12 4 800 1000 188 200
In QEmu, when the timer misses it often misses big: 437 ms, 687 ms, even 1469 ms. So most of the time expecting three events to take about 3 intervals does not help with reliability because the timer does not try to compensate the missed events. So at the end it will still be off by one interval (200 ms) or more.
I could not reproduce these big misses on the Windows 8.1 on cw-rx460 machine (i.e. real hardware).
This is the real problem, I guess. I mean, the operating system makes no guarantees about timers firing on time, of course, but when we try to wait for events to happen and they're frequently late by over a second, that makes things very difficult to test.
Is it possible the CPU is under heavy load?
Conclusion:
Because Windows constantly adjusts the timer interval so that they are only accurate *on average*, testing the total time for N events will be more reliable than testing the events individually.
But the occasional big delays mean any timing tolerances below ~1700 ms will be somewhat unreliable :-(
For timers one could check if a big delay happened in one of the loops and either account for it or reset the timer and retry in the hope no delay will occur then. But that's overly complex.
The same big delays are probably also the cause of the WaitForSingleObject() failures when the timeout is < 2000 ms.
The timing in Wine in Debian VMs is very reliable. Yet it is likely that QEmu is somehow responsible for the timing issues in the Windows VMs.
The best lead at this point is to check whether some VM clock setting is causing this. But some are also need to prevent reverts from taking forever[1]. (w8)
<clock offset='localtime'> <timer name='rtc' tickpolicy='delay'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> vs. (w1064) <hyperv> <relaxed state='on'/> <vapic state='on'/> <spinlocks state='on' retries='8191'/> </hyperv> <clock offset='localtime'> <timer name='rtc' tickpolicy='catchup' track='guest'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> <timer name='hypervclock' present='yes'/> </clock> vs. (fgtbdebian10) <clock offset='utc'/>
Otherwise try CPU pinning. But Linux seems to do really well without it...
[1] https://bugs.launchpad.net/qemu/+bug/1505041 https://bugs.launchpad.net/qemu/+bug/1174654
On Tue, 24 Mar 2020, Zebediah Figura wrote: [...]
- This means that based on just a few events one cannot expect the interval between most events to fall within a narrow range. So here for instance if the acceptable interval is 190-210 ms and the first interval is instead 237 ms, then the next one will necessarily be out of range too, and likely the one after that too. So expecting 2 out of 3 intervals to be within the range is no more reliable than checking just one interval.
Allowing for more error than 10ms seems reasonable to me, even by an order of magnitude.
The test tolerances are not that tight, as far as I know, and certainly not for this threadpool timer test. That was just me testing an alternative approach and finding it to not be viable. As I said in this specific case the allowed range is 500-750 for an expected 600 ms (3*200 ms).
But there are cases in other tests where we do a TerminateProcess() or similar and expect the WaitForSingleObject() to return within 100 ms. I don't think those are correct. Even 1s feels short. The recent kernel32:process helper functions replaced a bunch of them with wait_child_process() calls so now the timeout is 30s. I may align the remaining timeouts with that... though I feel 30s is a bit large. Surely 10s should be enough?
[...]
In QEmu, when the timer misses it often misses big: 437 ms, 687 ms, even 1469 ms. So most of the time expecting three events to take about 3 intervals does not help with reliability because the timer does not try to compensate the missed events. So at the end it will still be off by one interval (200 ms) or more.
I could not reproduce these big misses on the Windows 8.1 on cw-rx460 machine (i.e. real hardware).
This is the real problem, I guess. I mean, the operating system makes no guarantees about timers firing on time, of course, but when we try to wait for events to happen and they're frequently late by over a second, that makes things very difficult to test.
Is it possible the CPU is under heavy load?
Not really, no. There's really not much running on the VM hosts:
* VMs We run at most one VM at a time per host, precisely to make sure the activity in one VM does not interfere with the tests running in the other VM(s). Of course it make the TestBot pretty inefficient and it also does not prevent these delays :-(
* Unattended upgrades Once a day apt will check for security updates and install them. But on Debian stable that should not amount to much.
* Acts of administrator Mostly VM backup/restore, debugging, reconfiguring. But these are too infrequent to explain all the delays we get.
Also I'm not convinced CPU load on the host is the cause of these delays.
I guess the first step would be to figure out what timer source Windows actually ends up using. As the configuration snippet shows, there are a couple of options it can choose from. (I seem to recall there was a way to figure this out, but don't recall from the top of my head.)
I've read claims that compared to earlier versions Windows 10 will prefer the TSC over the HPET because it has better latency, but that that may not be the best choice for VMs. I also notice "hypervclock" in the configuration snippet above; I suspect Windows 10 will prefer that one if it's present, but I have no idea how stable that clock is.