My apologies for this last bout of WineTestBot brokenness.
It all started when the VM host froze late last Friday and could not be remotely rebooted. So Newman power-cycled it Monday morning. He also suggested upgrading the kernel in the hope that this would avoid further crashes which I agreed to. Then I decided to upgrade to QEMU 1.7 in the hope of fixing the Dr6 ntdll:exception failures, or at least be in a better position to do so. Of course that entails redoing all the live VM snapshots (1) but I was prepared to do so. Then things went south.
The issues ----------
I initially ran into some SELinux incompatibilities and then into QEMU/Libvirt incompatibilities (2). Then the VMs were suspending after a few seconds which turned out to be because the disk was full. My fault: I keep too many VM backups so the new VM backups I created finished filling it. That corrupted some VMs which, ironically, I was able to restore from backup after deleting older backups. Now the host now has a sane amount of free disk space and monitor it closely.
But the real problem is that now the VMs get corrupted after a bit of use. This manifests through a couple of symptoms: * Sometimes the build VM will detect EXT4 filesystem corruption, remount '/' as read-only and obviously stop working properly. * Sometimes no filesystem corruption is detected but the content of files gets corrupted. For instance this is what caused all the 'Missing build status line' errors when half of the WineTestBot Build.pl sscript got lost. I suspect this also caused a round of build failures related to memcmp() being missing. * Sometimes the wtbbuild ends up at the grub prompt and complains that it cannot find the filesystem. Since these are live snapshots this is presumably preceded by a crash+reboot of the guest. * Sometimes, after qemu has been properly stopped, a 'qemu-img check' finds a lot of errors. Sometimes not despite the VM being broken. * The Windows 2000 VM also sometimes goes south: it resets and fails to reboot complaining some checksum is corrupted (probably that of the Windows boot loader).
Of course QEMU behaves while I'm updating a VM's live snapshot. It's only after I've spent time doing so and creating a backup that it break the VM (thus also casting doubt into the trustworthiness of the new backup).
Further compounding the problems, once the WineTestBot is on a tear compiling a bunch of patches on the lone build VM it's unstoppable (3). It also tends to get stuck whenever I restart libvirt or the VM host (a known issue I was working on before this episode).
The strange thing is that QEMU 1.7 seems to work fine on my test environment. However it now appears that QEMU has at least three very different codepaths: * User-mode emulation which is slow and should not be used. * kvm_intel, uses the Intel VMX instructions, is used on my (Intel) test environment, has the icebp bug, but seems to otherwise be reliable. * kvm_amd, uses the AMD SVM instructions, is used by the WineTestBot's (AMD) VM host, has the Dr6 bug, and has been unreliable since last weekend.
As kvm_adm and kvm_intel are kernel modules, the kernel version might actually be more important than the QEMU one. Indeed my latest tests seem to indicate that reverting the kernel from the 3.13 to 3.2.0 fixes the VM corruption issues (I also tested 3.14rc7 which did not help).
A corrollary is that any QEMU tests I can do in my home environment are likely to be poor predictors for what will happen on the production VM host. Indeed it's because QEMU 1.7.0 seems to work fine here that I decided to upgrade the VM host.
Short term goal ---------------
Restore the WineTestBot to a working state!
The current hope is that just reverting to a pre-3.13 kernel, maybe 3.2.0 will do the trick. That would make it possible to stick with QEMU 1.7.0 (Debian 7.0 (Stable/Wheezy) has 1.1.2 which is really too old but Wheezy-Backports has moved on to 1.7.0 so there's no easily accessible 1.6.0 packages to go back to).
Longer term goals -----------------
* Solve the VM host crashes. The memory has been tested quite extensively already, and I did a badblocks pass on the hard-drive. Neither found anything. I could then test the process using PrimeNet. but the MRTG graphs did not indicate a tendency to overheat or other such problems. It now seems the crashes may be caused by the 3.2.0 kernel, and probably specifically by the kvm/kvm_amd modules. So finding a more recent kernel that actually works might help.
* While I hope the VM host will never crash again, it would be nice to be able to remotely power-cycle it.
* It will also be necessary to fix the stability issues in the 3.13 and 3.14 kvm_amd module. However given that I don't have an AMD box at hand and already way too many other things to do I don't see how that's going to happen. Maybe through a bug report once I get a better handle on this. Still the VM host cannot remain stuck on 3.2.0 indefinitely.
* Given that on AMD the Dr6 bug still seems to be present in QEMU 2.0/Linux 3.13, it still needs to be fixed. (And the icebp one on Intel would be nice too).
* Fix the mysterious 'network timeout' errors we get while waiting for a WineTest task to complete. Unfortunately the first set of patches to tackle them were not really conclusive. So maybe switch to plan B, i.e. blindly reconnect to work around them. That would be quite unsatisfying though.
* Then resume work on making the WineTestBot Engine (more) resilient to network outages and VM host crashes. I started patches for that but working on them it became clear that this was entangled with proper handling and diagnostics after 'network timeout' errors.
* Then resume work on all the other features and bugs of the WineTestBot.
(1) QEmu 1.7.0 cannot restore a 1.6.0 live snapshot made in qemu-system-x86_64 https://bugs.launchpad.net/qemu/+bug/1259499
(2) A known issue caused by QEMU changing the 'qemu-system-x86_64 -cpu help' output format which is parsed by Libvirt to figure out which kinds of CPUs can be emulated.
(3) Bug 35946 - Cannot mark a VM for maintenance if it is running a task http://bugs.winehq.org/show_bug.cgi?id=35946