So I have put w10pro64 into production.
As the name implies this is a 64-bit Windows 10 Professional VM. What the name does not say is that it runs the latest version of Windows 10: 2004. That means it has more failures than the others... for now.
The goal is to use it to balance the load across two VM hosts. So it will run the various language tests, always against the latest Windows 10 release, while w1064 will deal with the previous Windows 10 releases and other configurations such as dual-screen and (hopefully) PCI passthrough.
Right now w10pro64 also runs the dual-screen tests because it has a newer QXL driver that should have fewer failures (bug 48926) but that should change after I update w1064.
For those who are interested I did quite a few tests on w10pro64 before putting it in production to see the impact of the QEmu configuration.
One part of it was to see if it was possible to reduce the number of failures by tweaking the configuration. That did not yield any meaningful result.
The other part was to check various options' impact on performance.
CPU: IvyBridge * 3 cores ------------------------
IvyBridge is the baseline of our current VM hosts (vm1, vm3 and vm4). So it should be possible to move the VM from one host to the other without changing its configuration (and also without risking upsetting Windows' license checks).
Most of our tests are single threaded. But in order to root out race conditions I think all VMs should have at least 2 vcpus. The question was whether adding more would help.
So I used mpstat at a 5 second interval to trace the CPU usage on the host while WineTest ran in the VM. I mostly ran the tests with 4 vcpus (specifically 4 cores to avoid licensing issues). The host has 4 cores.
This showed that even when given 4 cores the VM spends 70% to 80% (depending on the run) of its time using less than one core, 97% using less than two cores and only 0.5% using more than 3 cores. So giving it two or three cores is plenty.
So what is the CPU doing when not running the VM / tests? The stats show it waits for IO only 3% of the time which is as it should given the caching available on the host and the SSD disk. System and user CPU usage are also pretty low so most of the time the CPU is just idle. More specifically the host is 75% idle (i.e. uses less than 1 core) more than 50% of the time.
The why is still somewhat of a mystery to me. Idle time can result from the audio tests (waiting for the buffered sound to play) and network tests (waiting for network data). There are also a few places where we wait for some operation to time out but surely not that many? So how can we eliminate this idle time and speed up the tests?
Memory : 4GB ------------
A test with 8GB shows adding memory does not help the test or allow them to run faster.
I prefer limiting how much memory the VMs use because I expect it to result in smaller live snapshots: w10pro64's disk image shot from 14 GB to 53GB when I added the 13 live snaphosts. That works out to about 3GB per live snapshot (disk COW+RAM). Interestingly it's less than the VM's amount of memory which means QEmu does not save the unused memory. But I suspect QEmu still saves Windows disk cache so that increasing memory result in bigger snapshots.
Clock : HPET ------------
Initially the guest was using a significant amount of CPU on the host even when Windows was doing nothing. It turns out this is because by default libvirt does not add the HPET timer. Adding the following line fixed this:
<clock offset='localtime'> [...] <timer name='hpet' present='yes'/> </clock>
Disk: Virtio SCSI + unmap -------------------------
The SCSI Virtio driver is the recommended configuration and I manually set the discard mode to unmap to prevent qcow2 bloat (is that QEmu's default?).
Then I tested the disk performance with ATTO. https://www.atto.com/disk-benchmark/
* In its default configuration ATTO uses a small 128 MB test file. Since such a small file easily fits in the OS' cache ATTO uses fsync-like functionality to ensure it tests the disk performance rather than the memory's.
* But in the default QEmu configuration (writeback mode) caching still occurs outside the VM which fools Atto and results in read and write speeds in the GB/s range on a SATA SSD (see w10pro64_scsi+default+unmap.png). But then our tests don't write all that much to disk so this test is quite realistic. All in all this means the default configuration should provide more than fast enough disk access.
* The results are the same when caching is explicitly set to writeback (i.e. it's QEmu's default). (see wtbw10pro64_scsi+writeback+unmap.png)
* I also ran an ATTO test with a bigger file size (see w10pro64_scsi+default+unmap+4GB.png). We then clearly see writes being capped by the SSD speed while reads still benefit from the host cache. This shows that disk performance is still ok even when writing more data.
* Some sites recommend setting io.mode=threads but that forces cache.mode=none or directsync. That prevents the host from doing extra caching and then we find the true underlying disk performance in ATTO. I think that configuration makes sense when one wants to be sure the VM's filesystem will remain in a consistent state in case of a host crash or power outage. But in such a case we would just revert the VM to the last snapshot and continue. Then the default configuration provides us with better disk performance. (see w10pro64_scsi+directsync+native+unmap.png and for comparison directsync alone w10pro64_scsi+directsync+unmap.png)