On 8/31/19 11:47 AM, Francois Gouget wrote:
On Fri, 30 Aug 2019, Rémi Bernon wrote:
On 8/30/19 3:03 PM, Marvin wrote:
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=56052
Your paranoid android.
=== build (build log) ===
Task errors: BotError: The VM is not powered on
I did a successful run with the same patch here: https://testbot.winehq.org/JobDetails.pl?Key=56051
Yes, here's what happened:
When it has nothing to do the TestBot picks some VMs that it starts up in advance in the hope they will be needed by the next job.
Because the build VM is used to provide the Windows binaries for testing on Windows it's needed by almost every job. So its given a high priority and ends up being prepared in advance and thus is recorded by the TestBot as being in the idle state.
But then there was a power outage so all the VMs got powered off.
But the TestBot server is on a separate location and was not powered off so it was not aware that the VMs got powered off. The thing is these days the Engine never uses libvirt because these calls are blocking which means if it tries to communicate with a dead VM host of one where libvirt is hosed, these calls can block for a long time (up to 10 minutes), which would block the Engine for all that time. Instead it assumes the information it has in its database about the VM is accurate and forks a process whenever it needs to perform an operation on a VM, whether that's running a task, shutting it down or reverting it.
So it just scheduled the taks on the build VM as usual. But the child process could not communicate with the VMs, checked its state and complained that there was an error because "The VM is not powered on".
What's wrong is that it marked the task as failed. A better recovery mechanism would have been to either mark the VM as "dirty" or "offline" and put the task back in the queued state so the TesBot tries running it again.
The risk is that if the reason why the VM is not usable is not caused by an external factor (such as here), the next round is likely to produce the same result, leading the TestBot to try to run the same highest priority task again and again on the one borked VM.
Finally the reason why you won't see that job as failed if you look a it now is because I restarted it. The user who submitted a job that failed due to a TestBot error gets a button to restart it. A user can only restart his own jobs and I'm not sure it that would have been possible in this case since the job came from a wine-devel email (but the administrator gets to restart anyone's jobs ;-).
Anyway I'll see about tweaking the task scripts to avoid this situation in the future.
Thanks for the details!