https://bugs.winehq.org/show_bug.cgi?id=44688
Bug ID: 44688 Summary: Detect stuck processes Product: Wine-Testbot Version: unspecified Hardware: x86 OS: Linux Status: NEW Severity: normal Priority: P2 Component: unknown Assignee: wine-bugs@winehq.org Reporter: fgouget@codeweavers.com Distribution: ---
Sometimes a TestBot worker process can get stuck.
This can happen to LibvirtTool.pl, particularly when dealing with offline VMs.
But it can also happen to regular scripts like WineRunTask.pl when using TestAgent to send or retrieve a file.
In both cases the TestBot Engine should have a way to detect stuck processes and simply kill them.
To detect stuck processes add two fields to the VM table. ChildStarted - The current child process start timestamp. ChildTimeout - How long the current child process is allowed to run.
Most of our tasks already have timeouts so it's just a matter of reusing this timeout and adding some leeway. For the revert and offline tasks we could use 5 and 60 minutes respectively. Then the Jobs::_CheckAndClassifyVMs() method can check those fields and kill the stuck processes. This works because the Engine's SafetyNet() method schedules jobs every 10 minutes as a fallback.
The reason for using two fields instead of a single ChildDeadline one is that the ChildStarted field could be useful to know which period to analyze when collecting the Munin statistics (currently we analyze an arbitrary period of time that's supposed to cover the worst case).