TestBot: The new w10pro64 VM - wine-devel

23 Sep 2020


      So I have put w10pro64 into production.
As the name implies this is a 64-bit Windows 10 Professional VM. What 
the name does not say is that it runs the latest version of Windows 10: 
2004. That means it has more failures than the others... for now.
The goal is to use it to balance the load across two VM hosts. So it 
will run the various language tests, always against the latest Windows 
10 release, while w1064 will deal with the previous Windows 10 releases 
and other configurations such as dual-screen and (hopefully) PCI 
passthrough.
Right now w10pro64 also runs the dual-screen tests because it has a 
newer QXL driver that should have fewer failures (bug 48926) but that 
should change after I update w1064.
For those who are interested I did quite a few tests on w10pro64 before 
putting it in production to see the impact of the QEmu configuration.
One part of it was to see if it was possible to reduce the number of 
failures by tweaking the configuration. That did not yield any 
meaningful result.
The other part was to check various options' impact on performance.
CPU: IvyBridge * 3 cores
------------------------
IvyBridge is the baseline of our current VM hosts (vm1, vm3 and vm4). So 
it should be possible to move the VM from one host to the other without 
changing its configuration (and also without risking upsetting Windows' 
license checks).
Most of our tests are single threaded. But in order to root out race 
conditions I think all VMs should have at least 2 vcpus. The question 
was whether adding more would help.
So I used mpstat at a 5 second interval to trace the CPU usage on the 
host while WineTest ran in the VM. I mostly ran the tests with 4 vcpus 
(specifically 4 cores to avoid licensing issues). The host has 4 cores.
This showed that even when given 4 cores the VM spends 70% to 80% 
(depending on the run) of its time using less than one core, 97% using 
less than two cores and only 0.5% using more than 3 cores. So giving it 
two or three cores is plenty.
So what is the CPU doing when not running the VM / tests? The stats show 
it waits for IO only 3% of the time which is as it should given the 
caching available on the host and the SSD disk. System and user CPU 
usage are also pretty low so most of the time the CPU is just idle. More 
specifically the host is 75% idle (i.e. uses less than 1 core) more than 
50% of the time.
The why is still somewhat of a mystery to me. Idle time can result from 
the audio tests (waiting for the buffered sound to play) and network 
tests (waiting for network data). There are also a few places where we 
wait for some operation to time out but surely not that many? So how can 
we eliminate this idle time and speed up the tests?
Memory : 4GB
------------
A test with 8GB shows adding memory does not help the test or allow them 
to run faster.
I prefer limiting how much memory the VMs use because I expect it to 
result in smaller live snapshots: w10pro64's disk image shot from 14 GB 
to 53GB when I added the 13 live snaphosts. That works out to about 3GB 
per live snapshot (disk COW+RAM). Interestingly it's less than the VM's 
amount of memory which means QEmu does not save the unused memory. But I 
suspect QEmu still saves Windows disk cache so that increasing memory 
result in bigger snapshots.
Clock : HPET
------------
Initially the guest was using a significant amount of CPU on the host 
even when Windows was doing nothing. It turns out this is because by 
default libvirt does not add the HPET timer. Adding the following line 
fixed this:
<clock offset='localtime'>
    [...]
    <timer name='hpet' present='yes'/>
  </clock>
Disk: Virtio SCSI + unmap
-------------------------
The SCSI Virtio driver is the recommended configuration and I manually 
set the discard mode to unmap to prevent qcow2 bloat (is that QEmu's 
default?).
Then I tested the disk performance with ATTO.
https://www.atto.com/disk-benchmark/
* In its default configuration ATTO uses a small 128 MB test file. Since 
  such a small file easily fits in the OS' cache ATTO uses fsync-like 
  functionality to ensure it tests the disk performance rather than the 
  memory's.
* But in the default QEmu configuration (writeback mode) caching still 
  occurs outside the VM which fools Atto and results in read and write 
  speeds in the GB/s range on a SATA SSD (see 
  w10pro64_scsi+default+unmap.png). But then our tests don't write all 
  that much to disk so this test is quite realistic. All in all this 
  means the default configuration should provide more than fast enough 
  disk access.
* The results are the same when caching is explicitly set to writeback 
  (i.e. it's QEmu's default). (see wtbw10pro64_scsi+writeback+unmap.png)
* I also ran an ATTO test with a bigger file size (see 
  w10pro64_scsi+default+unmap+4GB.png). We then clearly see writes being 
  capped by the SSD speed while reads still benefit from the host cache. 
  This shows that disk performance is still ok even when writing more 
  data.
* Some sites recommend setting io.mode=threads but that forces 
  cache.mode=none or directsync. That prevents the host from doing extra 
  caching and then we find the true underlying disk performance in ATTO. 
  I think that configuration makes sense when one wants to be sure the 
  VM's filesystem will remain in a consistent state in case of a host 
  crash or power outage. But in such a case we would just revert the VM 
  to the last snapshot and continue. Then the default configuration 
  provides us with better disk performance. (see 
  w10pro64_scsi+directsync+native+unmap.png and for comparison 
  directsync alone w10pro64_scsi+directsync+unmap.png)
-- 
Francois Gouget fgouget@codeweavers.com