On Thu, 28 Apr 2022, Zhiyi Zhang wrote: [...]
Could you make them go faster? Maybe balancing the load a bit or adding more hardware?
The issue is that WineTest takes time and new jobs have to wait for running tasks to complete to get their turn. But then they have priority over WineTest.
I collected some data about the WineTest tasks (see attached spreadsheet) and they take between 25 minutes on Windows and 35 minutes on Linux. The main issue here is VMs that have many test configurations which must therefore be run sequentially. The three VMs with the longest chains are:
Time Configs VM 6.7 h 11 debian11 6.7 h 16 w1064 6.8 h 15 w10pro64
What this means is that no amount of rebalancing can get the tests to run in less than about 7 hours.
And here are the results at the VM host level:
Time Configs Host 7.2 h 12 vm1 1.4 h 3 vm2 7.7 h 18 vm3 12.1 h 25 vm4
The issue is vm2 is too slow and old to run most VMs nowadays. So moving some test configurations from vm4 to vm1 or vm3 will push those to 9 / 10 hours. So I'll restart the process of getting new hardware to replace vm2.
The other options: * Fix the tests that get stuck: they waste 2 minutes each. But it looks like there's only two of those left, conhost.exe:tty and wscript.exe:run, so there's not much to gain.
* Speed up the slow tests, potentially by using multi-threading. What sucks is we have no way of tracking which tests are slow, which test configurations are slow, etc. It would be nice to have something like the patterns page but for runtime (and also for the tests output size).
* Getting hardware with faster single thread performance: over 90% of the tests are single-threaded. vm2 is meant to be the first step towards this.
* Splitting the VMs with many test configurations so the test load can be spread across multiple hosts. That is, instead of having a single VM with 15 test configurations that must run sequentially like w10pro64, have two VMs with 7 and 8 configurations each that can run in parallel. But that makes an extra VM to manage and requires having hosts to spread them to :-(
* Load balancing could help, assuming the TestBot is smart enough.
That is, if it starts by running the debiant and w7u tasks on vm4, then by the time the other hosts are idle all that's left to run is w10pro64's 15 test configurations that must be run sequentially anyway. So the scheduler must give priority to the VMs with the highest count of pending tasks.
Load balancing could help reduce the latency by ensuring the builds are done earlier. Here's a worst case scenario right now: t=0 vm2 starts a WineTest job t=1 Developper submits a job. First comes the build step t=25 vm2 completes the WineTest job t=25 vm1, vm3 and vm4 each start a new WineTest job t=26 vm2 completes the developer's build task t=50 vm1, vm3 and vm4 complete their WineTest task t=51 vm1, vm3 and vm4 starts the developer's Windows tasks Having multiple build VMs would make it more likely that the blocking build step is completed before any other WineTest task. This is also why it's good that vm2 is not too busy.
* Reducing the number of test configurations :-(