I upgraded and synchronized the configuration of all four VM hosts. So now they either all work great or are all broken ;-)
They are now all on Debian 8.6 with three pieces from backports: * The Linux kernel is now 4.8.11-1~bpo8+1. * The new kernel required linux-base to be upgraded to 4.3~bpo8+1. * QEMU is now 1:2.7+dfsg-3~bpo8+2. * libvirt is still 1.2.9-9+deb8u3 as there is nothing newer for Debian 8. This required using a small script to get it to play nice with QEMU 2.7 (see attached file).
I synchronized their configuration by diffing their /etc directories and editing the files to remove differences (Cluster SSH is really nice for that). They should now essentially be identical (there's obvious differences in hostnames, ssh server keys, etc).
The upgrade and configuration syncing did not solve the performance issue on vm1 so that mystery is still intact.
The build VM went offline a few times. Fortunately it turns out it was my fault. What happened is that on december 16 the TestBot failed to recreate the wtb live snapshot. Given that the VM with in an unknown state I reverted it to the wtbbase8 powered off snapshot and recreated the live snapshot from that. However wtbbase8 still had the old 3.16 kernel that was causing the build VM to regularly go offline. So this time I went back to wtbbase8, upgraded the kernel again, took a new wtbbase8.1 powered off snapshot for the next time I need one, and then recreated the wtb live snapshot.
Now the question is why did the TestBot fail to recreate the wtb snapshot? The only theory I have right now is that the network glitched somewhere between deleting the old snapshot and creating the new one. It's the first time this happened so hopefully it won't happen again any time soon. Still it at some point it would be nice to change the procedure to one that can just be re-run if it fails. That means reverting to a different snapshot than the one we delete and recreate.
Also as you can see from the WineTest results page the 64 bit Windows 8 and Windows 10 VMs no longer crash while running WineTest.
* On w1064 Windows crash and reboot was caused by ntdll:exception. The workaround is to tell kvm to ignore accesses to unsupported MSR. See the links in TestBot bug 40240 for more details. Unfortunately this needs to be set every time after boot and I forgot to do so when I did the hosts upgrades. So there's a fe days gap. But I have now added an init script that should take care of that automatically on boot. http://bugs.winehq.com/show_bug.cgi?id=40240
* On w864 the Windows freeze was caused by rasapi32:rasapi. The workaround is to configure access to the VM through Spice rather than VNC. Somehow this makes a difference for a bunch of tests even though no client connects to the VM while it's running the tests. See the TestBot bug 42185. https://bugs.winehq.org/show_bug.cgi?id=42185
I also updated the TestBot test suite and put it up on GitHub. The wtbsuite as I call it is a set of patches that apply on top of Wine and which can be submitted in bulk to the TestBot to verify that it works as expected. The patches strive to exercise all the situations the TestBot can run into like patches that don't apply, build failures, timeouts, tests that crash, patch sets, etc. When appropriate the patches contain a reference to the relevant TestBot bug. You can find the test suite there: https://github.com/fgouget/wine/tree/wtbsuite