On Tue, 24 Mar 2020, Zebediah Figura wrote: [...]
- This means that based on just a few events one cannot expect the interval between most events to fall within a narrow range. So here for instance if the acceptable interval is 190-210 ms and the first interval is instead 237 ms, then the next one will necessarily be out of range too, and likely the one after that too. So expecting 2 out of 3 intervals to be within the range is no more reliable than checking just one interval.
Allowing for more error than 10ms seems reasonable to me, even by an order of magnitude.
The test tolerances are not that tight, as far as I know, and certainly not for this threadpool timer test. That was just me testing an alternative approach and finding it to not be viable. As I said in this specific case the allowed range is 500-750 for an expected 600 ms (3*200 ms).
But there are cases in other tests where we do a TerminateProcess() or similar and expect the WaitForSingleObject() to return within 100 ms. I don't think those are correct. Even 1s feels short. The recent kernel32:process helper functions replaced a bunch of them with wait_child_process() calls so now the timeout is 30s. I may align the remaining timeouts with that... though I feel 30s is a bit large. Surely 10s should be enough?
[...]
In QEmu, when the timer misses it often misses big: 437 ms, 687 ms, even 1469 ms. So most of the time expecting three events to take about 3 intervals does not help with reliability because the timer does not try to compensate the missed events. So at the end it will still be off by one interval (200 ms) or more.
I could not reproduce these big misses on the Windows 8.1 on cw-rx460 machine (i.e. real hardware).
This is the real problem, I guess. I mean, the operating system makes no guarantees about timers firing on time, of course, but when we try to wait for events to happen and they're frequently late by over a second, that makes things very difficult to test.
Is it possible the CPU is under heavy load?
Not really, no. There's really not much running on the VM hosts:
* VMs We run at most one VM at a time per host, precisely to make sure the activity in one VM does not interfere with the tests running in the other VM(s). Of course it make the TestBot pretty inefficient and it also does not prevent these delays :-(
* Unattended upgrades Once a day apt will check for security updates and install them. But on Debian stable that should not amount to much.
* Acts of administrator Mostly VM backup/restore, debugging, reconfiguring. But these are too infrequent to explain all the delays we get.
Also I'm not convinced CPU load on the host is the cause of these delays.