Last week the power supply of the gateway to the TestBot VMs blew up. Fortunately Newman was able to quickly replace that box.
Then I rebooted vm1 but it remained stuck on the grub stage 1 due to a grub regression. I would normally recover from such situations by booting off the KVM's virtual USB CD drive but vm1 would not boot from it. In the end I got vm1 booting off PXE which is going to be my go-to option for such situations because it really works everywhere [1]. Then I rebooted all the other VM hosts to fix the grub issue in case they were impacted.
But on the 18th ntdll:exception started causing the Windows 10 VM to crash (trick: check the screenshot). That matched the vm3 reboot so I first suspected BIOS settings but I have now confirmed that the regression was caused by the upgrade from the 4.19.0-9 kernel to 4.19.0-10!
Even with 4.19.0-9 each run of ntdll:exception generates the following kernel messages:
[ 295.717619] kvm [9344]: vcpu0, guest rIP: 0xfffff803be7de713 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop [ 295.717650] kvm [9344]: vcpu0, guest rIP: 0xfffff803be7de77c ignored rdmsr: 0x1c9 [ 295.717666] kvm [9344]: vcpu0, guest rIP: 0xfffff803be7de78a ignored rdmsr: 0x40 [ 295.717682] kvm [9344]: vcpu0, guest rIP: 0xfffff803be7de7a1 ignored rdmsr: 0x60 [... last 4 lines repeated two more times ...] [ 295.717859] kvm [9344]: vcpu0, guest rIP: 0xfffff803be7de713 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop [ 295.717884] kvm [9344]: vcpu0, guest rIP: 0xfffff803be7de77c ignored rdmsr: 0x1c9 [ 295.717907] kvm [9344]: vcpu0, guest rIP: 0xfffff803be7de713 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop [... last line repeated 5 more times ...]
With 4.19.0-10 the Windows 10 VM crashes on the first trace. The workaround is to add "options kvm ignore_msrs=1" to /etc/modprobe. One still gets the same set of kernel traces but no crash anymore [2]. So that lets w1064 run ntdll:exception and WineTest successfully again.
Various web sources describe ignore_msrs=1 as a workaround until the kernel kvm module is fixed which I don't find very satisfying. vm4 is not doing much these days (it runs wtbdebian10 and wxppro which are both deprecated) so I experimented with the Debian Backports on it (like all VM hosts it's running Debian 10). That allowed me to upgrade the kernel to 5.7.0-0.bpo.2 and QEmu from 3.1+dfsg-8+deb10u7 to 5.0-14~bpo10+1. And that allowed me to confirm that this msr issue is still present :-( At least the same workaround still works.
It also proves the ground for getting a newer QEmu without having to move to Debian Testing. Maybe that will prove useful in time.
vm1 and vm2 also have slightly different rdmsr / wrmsr messages:
Aug 18 13:29:02 vm1 kernel: [ 1684.092430] kvm [4996]: vcpu0, guest rIP: 0xffffffff8645eb63 ignored rdmsr: 0x606 Aug 18 13:29:02 vm1 kernel: [ 1684.229925] kvm [4996]: vcpu0, guest rIP: 0xffffffff8645eb63 ignored rdmsr: 0x611 Aug 18 13:29:02 vm1 kernel: [ 1684.229950] kvm [4996]: vcpu0, guest rIP: 0xffffffff8645eb63 ignored rdmsr: 0x639 Aug 18 13:29:02 vm1 kernel: [ 1684.229967] kvm [4996]: vcpu0, guest rIP: 0xffffffff8645eb63 ignored rdmsr: 0x641 Aug 18 13:29:02 vm1 kernel: [ 1684.229983] kvm [4996]: vcpu0, guest rIP: 0xffffffff8645eb63 ignored rdmsr: 0x619
Aug 18 08:21:50 vm2 kernel: [2464330.778885] kvm [21199]: vcpu0, guest rIP: 0xffffffff81a5eb63 ignored rdmsr: 0xc0010048 Aug 18 09:49:30 vm2 kernel: [2469591.803298] kvm [8420]: vcpu0, guest rIP: 0xffffffffad25eb63 ignored rdmsr: 0xc0010048 Aug 18 11:33:54 vm2 kernel: [2475855.580083] kvm [29506]: vcpu0, guest rIP: 0xffffffff9885eb63 ignored rdmsr: 0xc0010048 Aug 18 13:53:22 vm2 kernel: [ 1063.562138] kvm [18219]: vcpu0, guest rIP: 0xffffffff8ea5eb63 ignored rdmsr: 0xc0010048 Aug 19 03:38:16 vm2 kernel: [50561.578179] kvm [25600]: vcpu0, guest rIP: 0xffffffffabe5eb63 ignored rdmsr: 0xc0010048
And nothing for the 20th, 21st or 22nd. That means they don't happen every day and thus are not triggered by WineTest. So for now I'm ignoring them.
And that's it for today.
[1] And despite what some web pages claim, you don't need access to the DHCP server to set up a PXE server. https://manski.net/2016/09/pxe-server-on-existing-network-dhcp-proxy-on-ubun...
[2] There's also a kvm option for silencing the traces but I don't think there's much point: better know what's going on.