My apologies for this last bout of WineTestBot brokenness.
It all started when the VM host froze late last Friday and could not be remotely rebooted. So Newman power-cycled it Monday morning. He also suggested upgrading the kernel in the hope that this would avoid further crashes which I agreed to. Then I decided to upgrade to QEMU 1.7 in the hope of fixing the Dr6 ntdll:exception failures, or at least be in a better position to do so. Of course that entails redoing all the live VM snapshots (1) but I was prepared to do so. Then things went south.
The issues ----------
I initially ran into some SELinux incompatibilities and then into QEMU/Libvirt incompatibilities (2). Then the VMs were suspending after a few seconds which turned out to be because the disk was full. My fault: I keep too many VM backups so the new VM backups I created finished filling it. That corrupted some VMs which, ironically, I was able to restore from backup after deleting older backups. Now the host now has a sane amount of free disk space and monitor it closely.
But the real problem is that now the VMs get corrupted after a bit of use. This manifests through a couple of symptoms: * Sometimes the build VM will detect EXT4 filesystem corruption, remount '/' as read-only and obviously stop working properly. * Sometimes no filesystem corruption is detected but the content of files gets corrupted. For instance this is what caused all the 'Missing build status line' errors when half of the WineTestBot Build.pl sscript got lost. I suspect this also caused a round of build failures related to memcmp() being missing. * Sometimes the wtbbuild ends up at the grub prompt and complains that it cannot find the filesystem. Since these are live snapshots this is presumably preceded by a crash+reboot of the guest. * Sometimes, after qemu has been properly stopped, a 'qemu-img check' finds a lot of errors. Sometimes not despite the VM being broken. * The Windows 2000 VM also sometimes goes south: it resets and fails to reboot complaining some checksum is corrupted (probably that of the Windows boot loader).
Of course QEMU behaves while I'm updating a VM's live snapshot. It's only after I've spent time doing so and creating a backup that it break the VM (thus also casting doubt into the trustworthiness of the new backup).
Further compounding the problems, once the WineTestBot is on a tear compiling a bunch of patches on the lone build VM it's unstoppable (3). It also tends to get stuck whenever I restart libvirt or the VM host (a known issue I was working on before this episode).
The strange thing is that QEMU 1.7 seems to work fine on my test environment. However it now appears that QEMU has at least three very different codepaths: * User-mode emulation which is slow and should not be used. * kvm_intel, uses the Intel VMX instructions, is used on my (Intel) test environment, has the icebp bug, but seems to otherwise be reliable. * kvm_amd, uses the AMD SVM instructions, is used by the WineTestBot's (AMD) VM host, has the Dr6 bug, and has been unreliable since last weekend.
As kvm_adm and kvm_intel are kernel modules, the kernel version might actually be more important than the QEMU one. Indeed my latest tests seem to indicate that reverting the kernel from the 3.13 to 3.2.0 fixes the VM corruption issues (I also tested 3.14rc7 which did not help).
A corrollary is that any QEMU tests I can do in my home environment are likely to be poor predictors for what will happen on the production VM host. Indeed it's because QEMU 1.7.0 seems to work fine here that I decided to upgrade the VM host.
Short term goal ---------------
Restore the WineTestBot to a working state!
The current hope is that just reverting to a pre-3.13 kernel, maybe 3.2.0 will do the trick. That would make it possible to stick with QEMU 1.7.0 (Debian 7.0 (Stable/Wheezy) has 1.1.2 which is really too old but Wheezy-Backports has moved on to 1.7.0 so there's no easily accessible 1.6.0 packages to go back to).
Longer term goals -----------------
* Solve the VM host crashes. The memory has been tested quite extensively already, and I did a badblocks pass on the hard-drive. Neither found anything. I could then test the process using PrimeNet. but the MRTG graphs did not indicate a tendency to overheat or other such problems. It now seems the crashes may be caused by the 3.2.0 kernel, and probably specifically by the kvm/kvm_amd modules. So finding a more recent kernel that actually works might help.
* While I hope the VM host will never crash again, it would be nice to be able to remotely power-cycle it.
* It will also be necessary to fix the stability issues in the 3.13 and 3.14 kvm_amd module. However given that I don't have an AMD box at hand and already way too many other things to do I don't see how that's going to happen. Maybe through a bug report once I get a better handle on this. Still the VM host cannot remain stuck on 3.2.0 indefinitely.
* Given that on AMD the Dr6 bug still seems to be present in QEMU 2.0/Linux 3.13, it still needs to be fixed. (And the icebp one on Intel would be nice too).
* Fix the mysterious 'network timeout' errors we get while waiting for a WineTest task to complete. Unfortunately the first set of patches to tackle them were not really conclusive. So maybe switch to plan B, i.e. blindly reconnect to work around them. That would be quite unsatisfying though.
* Then resume work on making the WineTestBot Engine (more) resilient to network outages and VM host crashes. I started patches for that but working on them it became clear that this was entangled with proper handling and diagnostics after 'network timeout' errors.
* Then resume work on all the other features and bugs of the WineTestBot.
(1) QEmu 1.7.0 cannot restore a 1.6.0 live snapshot made in qemu-system-x86_64 https://bugs.launchpad.net/qemu/+bug/1259499
(2) A known issue caused by QEMU changing the 'qemu-system-x86_64 -cpu help' output format which is parsed by Libvirt to figure out which kinds of CPUs can be emulated.
(3) Bug 35946 - Cannot mark a VM for maintenance if it is running a task http://bugs.winehq.org/show_bug.cgi?id=35946
Short term goal
Restore the WineTestBot to a working state!
Francois,
We can get an Intel based motherboard rapidly - would that be a faster path forward than trying to understand the AMD failures?
Cheers,
Jeremy
On Wed, 9 Apr 2014, Francois Gouget wrote: [...]
As kvm_adm and kvm_intel are kernel modules, the kernel version might actually be more important than the QEMU one. Indeed my latest tests seem to indicate that reverting the kernel from the 3.13 to 3.2.0 fixes the VM corruption issues (I also tested 3.14rc7 which did not help).
This is confirmed: * The 3.2.0 kernel does not corrupt the VMs but might be causing the VM host crash that happens once in a while * The 3.12, 3.13.1 and 3.14rc7 kernels corrupt the VMs in short order and thus are not usable.
Anyway, I have reverted the VM host to the 3.2.0 kernel and restored the build and Windows 2000 VMs. I'm also hoping the git apply failures will be fixed after doing tonight's Git update.
The tests also confirm that the Dr6 bug is present all the way to the 3.13.1 kernel and is fixed in 3.14rc7. Huw probably found the commit that fixed it so there's potentially hope of backporting it:
73aaf249ee2287b4686ff079dcbdbbb658156e64 http://o.cs.uvic.ca:20810/perl/cid.pl?cid=73aaf249ee2287b4686ff079dcbdbbb658...
I could theoretically restore all the VMs, redo their live snapshots for QEMU 1.7.0, probably also do a Windows update for XP and greater since the most up to date configurations are now essentially a year old, but that will take a day or two. The live snapshots would also have to be redone if we switch to Intel (but the poweredoff post-Windows update snapshots should be reusable).
The advantage of switching to Intel is that it seems to be more tested upstream, and that both Huw and I have Intel systems. So it let us detect and potentially fix such issues before changing the WineTestBot configuration. It may also fix the occasional host crashes. But this state of affairs is pretty disappointing.
So if the Intel config can be put together in the few days it may be best to restore a minimal set of VMs, and restore the full set after the new hardware has proven itself.
So if the Intel config can be put together in the few days it may be best to restore a minimal set of VMs, and restore the full set after the new hardware has proven itself.
Newman is quick; I think it's reasonable to believe it will have an Intel motherboard by this time tomorrow.
Cheers,
Jeremy
The new VM box is operational. It has an Intel Xeon E3-1230 and thus relies on kvm_intel instead of kvm_amd. But with the 3.13.1 kernel the VMs are still getting corrupted after a few hours :-( I have no idea why and I've never seen that issue on my personal box so I also have no idea how to figure out what's really going on.
So the VM host is back to the 3.2.0 kernel. Since then I did not have not had a VM go bad.
Then I moved on to restoring and updating the VMs. That's now done except for the two Vista VMs which I'll get to next. Here's what's new:
* w2000pro - Windows 2000 Professional IPX (bug 35239) and AppleTalk support has been activated for its network card. I also 'inserted' the MemTest86 CD for the kernel32:volume test (bug 31780).
* wxppro - Windows XP Professional It has all the updates it will ever get since its support period has now expired. As before it has WinPcap installed.
* w2008s64 - Windows 2008 Server (64-bits) This VM now has two network cards.
* w7u - Windows 7 Ultimate As with the other VMs I applied all Windows updates up to 2014/04/15, which means it now has Internet Explorer 11. As it's an Ultimate version I installed all the languages. Currently it's set up for Japanese but I can set it up for another language if desired. I also 'inserted' the MemTest86 CD so one can compare it with the equivalent in the Windows 2000 VM.
* w7pro64 - Windows 7 Professional (64-bit) It has all the latest Windows updates so it has Internet Explorer 11 too. However it's a plain English environment so you can use detect locale issue by comparing it with the results of the w7u VM.
* w8 - Windows 8 This has all the latest updates which means it's now running Windows 8.1 plus the Windows 8.1 Update. As before it also has the optional Direct X components, msxml4, and the Visual C++ runtime environments. And of course it now has Internet Explorer 11.
* w864 - Windows 8 (64-bit) Like the previous VM it has all the updates and thus is running the Windows 8.1 Update and Internet Explorer 11, but without all the extras.
If there's interest it's possible to run a VM from another snapshot, to test Windows 8.1 pre-Windows 8.1 for instance.
Hi Francois,
First of all, thanks for the hard work to WineTestBot working.
On 04/18/14 23:25, Francois Gouget wrote:
- w7pro64 - Windows 7 Professional (64-bit) It has all the latest Windows updates so it has Internet Explorer 11 too.
FWIW, I think upgrading to IE11 was a bad decision. Existing tests results made it clear that this will cause new failures on those VMs. While ideally we should fix tests on IE11 (I already fixed most of them, quite a few are still remaining), the upgrade should be better coordinated. Recovering from bad breakage of WineTestBot seems like the worst possible moment for increasing number of failures. This also makes base VMs unreliable for quite a few tests again for no good reason. We already knew those failures so the upgrade could wait just a bit more. We've just a lot further from having at least one succeeding win7 box again.
Jacek