https://bugs.winehq.org/show_bug.cgi?id=44688
Bug ID: 44688 Summary: Detect stuck processes Product: Wine-Testbot Version: unspecified Hardware: x86 OS: Linux Status: NEW Severity: normal Priority: P2 Component: unknown Assignee: wine-bugs@winehq.org Reporter: fgouget@codeweavers.com Distribution: ---
Sometimes a TestBot worker process can get stuck.
This can happen to LibvirtTool.pl, particularly when dealing with offline VMs.
But it can also happen to regular scripts like WineRunTask.pl when using TestAgent to send or retrieve a file.
In both cases the TestBot Engine should have a way to detect stuck processes and simply kill them.
To detect stuck processes add two fields to the VM table. ChildStarted - The current child process start timestamp. ChildTimeout - How long the current child process is allowed to run.
Most of our tasks already have timeouts so it's just a matter of reusing this timeout and adding some leeway. For the revert and offline tasks we could use 5 and 60 minutes respectively. Then the Jobs::_CheckAndClassifyVMs() method can check those fields and kill the stuck processes. This works because the Engine's SafetyNet() method schedules jobs every 10 minutes as a fallback.
The reason for using two fields instead of a single ChildDeadline one is that the ChildStarted field could be useful to know which period to analyze when collecting the Munin statistics (currently we analyze an arbitrary period of time that's supposed to cover the worst case).
https://bugs.winehq.org/show_bug.cgi?id=44688
François Gouget fgouget@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |FIXED Status|NEW |RESOLVED
--- Comment #1 from François Gouget fgouget@codeweavers.com --- This is done. The TestBot now also detects stuck revert processes, retries the revert and avoids infinite loops.
commit fca14eaa91f0ba811c5b90c028f6e824d07f8742 Author: Francois Gouget fgouget@codeweavers.com Date: Thu Jun 7 00:29:38 2018 +0200
testbot: Also mark the VM for maintenance if the reverts get stuck.
When a VM takes a long time to revert the LibvirtTool.pl process typically remains stuck in the Sys::Virt::DomainSnapshot::revert_to() call and cannot enforce the timeout itself, thus causing the timeout to be detected at the TestBot Engine level.
Signed-off-by: Francois Gouget fgouget@codeweavers.com Signed-off-by: Alexandre Julliard julliard@winehq.org
commit 02681e7b8fd3186f446add2391e69c87ca3a00df Author: Francois Gouget fgouget@codeweavers.com Date: Wed May 16 11:22:10 2018 +0200
testbot: Detect VM revert loops.
VM revert loops typically happen when a VM is misconfigured such that the TestBot fails to access the TestAgent daemon after reverting it. This results in the VM being put offline until it is accessible again through Libvirt which is the case so that it is immediately put back online and reverted again leading to a new error. With this patch the VM is put in maintenance mode for an administrator to look at if it has too many consecutive errors.
Signed-off-by: Francois Gouget fgouget@codeweavers.com Signed-off-by: Alexandre Julliard julliard@winehq.org
commit b65f81e0770dab83249bd472c7c5feb3e57267ec Author: Francois Gouget fgouget@codeweavers.com Date: Mon May 14 13:21:49 2018 +0200
testbot: Tweak the 'Putting VM offline' email.
Emphasize that the TestBot is still monitoring the VM.
Signed-off-by: Francois Gouget fgouget@codeweavers.com Signed-off-by: Alexandre Julliard julliard@winehq.org
commit 129508e7e60130f979b6715db00114d5100011a4 Author: Francois Gouget fgouget@codeweavers.com Date: Mon May 14 13:21:30 2018 +0200
testbot: Requeue the task in case the script gets stuck.
Count how many times the task has been requeued to avoid infinite loops, just like the scripts themselves normally do.
Signed-off-by: Francois Gouget fgouget@codeweavers.com Signed-off-by: Alexandre Julliard julliard@winehq.org
commit 2cd1475bf6269db4979499e8a63a6873f4e74362 Author: Francois Gouget fgouget@codeweavers.com Date: Fri May 11 00:15:22 2018 +0200
testbot: Reschedule at the latest when the next task times out.
This ensures we catch stuck tasks in a timely fashion. Note that we still reschedule every 10 minutes to catch any issues but the scheduler handles this itself instead of relying on SafetyNet().
Signed-off-by: Francois Gouget fgouget@codeweavers.com Signed-off-by: Alexandre Julliard julliard@winehq.org
commit a5d7bc263b1e355ee8b522812c8a0961d1d9d116 Author: Francois Gouget fgouget@codeweavers.com Date: Fri May 11 00:14:52 2018 +0200
testbot/Engine: Let event handlers add / remove events.
This makes it possible to handle events that happen at irregular intervals: the event is created as non-repeating and the event handler computes when the next event should happen and adds it.
Signed-off-by: Francois Gouget fgouget@codeweavers.com Signed-off-by: Alexandre Julliard julliard@winehq.org
commit 3ce81c0c6cf9e9829608441ab943da0503492d1d Author: Francois Gouget fgouget@codeweavers.com Date: Wed May 9 02:45:31 2018 +0200
testbot: Detect and kill stuck task scripts.
The tasks themselves have a timeout which the corresponding scripts enforce. However the scripts themselves may get stuck, typically due to network problems. When that happens they can end up blocking the whole TestBot. So make sure the TestBot engine itself can detect stuck scripts and take corrective action. Note that the detection is not very timely but will happen at the latest in the SafetyNet() function. This means there will be at most a 10 minutes delay.
Signed-off-by: Francois Gouget fgouget@codeweavers.com Signed-off-by: Alexandre Julliard julliard@winehq.org
https://bugs.winehq.org/show_bug.cgi?id=44688
Austin English austinenglish@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED
--- Comment #2 from Austin English austinenglish@gmail.com --- Closing.