I'm sending this patchset here for review and to provide an overview of what's to come. It's quite large so it shouldn't be committed in one go on the TestBot production system (hence the lack of the 'PATCH' prefix and signing off, those will go with the actual submission of the final patches).
The main goal of this patchset is to reduce the "new error" false positives caused by intermittent failures (bug 47998). But fixing this has a number of impacts which makes the patchset larger.
Some context: the TestBot detects new failures introduced by a patch by comparing its result to that of the last WineTest run. Any failure not present in the WineTest run is considered to be caused by the patch. This breaks down when a failure is intermittent and just happens on the patch's test run.
To avoid this the pathset modifies the TestBot to use all past WineTest runs as the comparison point. Only failures not present in any of them are considered new and attributed to the patch.
But the TestBot does the new failures detection when generating the JobDetails page (and also when sending the emails). This means parsing the WineTest results (600KB to 1.5MB) to extract failures, the same for the test results, and diffing them, for each test configuration (typically about 20), every time the page is loaded or refreshed. This results in typical load times between 0.2 and 2 seconds on my test system (up to 5 seconds on the official TestBot) (bug 48035). Annoying but not a major issue. But these can put quite a load on the web server when a rogue spider decides to crawl the TestBot site as happened a few times in december. Using all WineTest results on record instead of just the latest would increase the average load times by a factor of 20. This would make the page load time unacceptable and put too much load on the web server.
This means the JobDetails performance issues must be tackled first. Initially my idea was to just cache the errors lists. But in fact it's just as easy to also cache whether each error is new and thus do the detection only once. So the patchset introduces .errors files which contain only the errors extracted from a given task's log, whether each error is new, and its line number in the full report which allows for some future enhancements. With this change, generating the JobDetails page takes 0.01 to 0.15 seconds on my test system.
The structures used to manipulate the error lists were not practical and hardly extensible. In particular there was not a single list with a flag denoting new errors, but two lists, one with all the errors and another with only the new errors (see TagNewErrors()). Furthermore the error groups and errors in each group were returned separately (see GetLogErrors()), making them annoying to pass around and even harder to extend with more information such as line numbers. So the early patches rework the way errors are stored in memory to simplify usage and make their representation more extensible.
Before the patchset the WineRun* scripts would parse the test reports and store the extra validation errors in .err files (e.g. the too much output errors). But it's just as easy to store those in the .errors files, making the .err ones redundant. So they are removed on upgrade.
The web server is the wrong place to generate these 'cache' .errors files, in part because it does not have write access to the jobs/ directory. It's more logical to have each WineRun* script generate those when it completes. Then the .errors files are available when WineSendLog and JobDetails need them.
Because the new failures detection was performed each time JobDetails was run, it could happen long after the job completed. By that time there could be new WineTest results so simply using the latest ones would have changed the list of new failures of past jobs which would have made no sense. With the .errors files the new failures detection is performed only once when a task completes. So keeping a copy of the WineTest reference files is not really necessary anymore. But it's still useful to have them to diagnose issues.
The WineRunBuild script used to copy the latest WineTest reports to each of the job's step directories when the build completed. Instead it's much more logical to have each WineRun* script capture the WineTest reports it needs when it completes its task and to store them in the task's directory.
The reference WineTest results are stored in the latest/ directory so each needs a unique name. That name used to be of the form '<vmname>_<reportname>', for instance debian10_win32_fr_FR.report'. So this naming scheme was updated to allow for more than one report per test configuration. Specifically it now includes the job id since there's only one report per job for a given test configuration and this makes it easy to know where the reference report is coming from. Also underscores are allowed in both the VM names and report names which makes it hard to split the filename into its components. So the underscore was replaced by dashes. So that gives 'debian10-job63174-win32_fr_FR.report'.
The Engine used to perform the renaming / upgrade operations on startup. But generating the .errors files for every single job takes several minutes which makes it inappropriate as an Engine startup task. Also in order to work on the patchset it is necessary to be able to test code both before and after files were moved around and .errors files added.
So a new tool, UpdateTaskLogs was added that can rebuild all generated files from scratch and allows dealing with upgrades and downgrades. It's introduced before any changes to the way the logs are handled and updated to perform the required renames and to rebuild the generated files. This allows upgrading the content of the latest/ and jobs/ directories when needed. Also when reverting to an earlier version of the code it can mostly restore the content of these directories to their original state, thus making it possible to test the old code. And should things go wrong on the production server this will also provide a way to roll back the state of the TestBot.
The TestBot engine redirects the WineRun* scripts stderr to log.err in order to capture Perl errors. This makes log.err different from all other .err files since it cannot be rebuilt from the content of the corresponding log file. So the patchset renames log and log.err to task.log and testbot.log respectively. Of course UpdateTaskLogs deals with the needed renames on upgrade. This is also the reason for the 'mostly' in the previous section: UpdateTaskLogs does not revert these renames on downgrades (but that's easy to deal with with a find -exec command).
Another issue is that the WineRun* scripts obviously cannot generate the .errors file for testbot.log. So instead this file is generated by WineSendLog which is called as soon as the overall job is complete. Normally testbot.log is empty and thus ignored anyway. Also the JobDetails page needs to deal with cases where the .errors file is missing in any case. So when that happens it simply points the user to the full log.
Finally, if we have two WineTest reports, one with errors A and B, and an older one with only error B, then the second report is useless as far as detecting new failures go. So one can speed up new failure detection by only keeping the useful reports. With the new/old status being stored in the .errors file this is not really crucial since it only slows down the WineRun* scripts. The patchset still includes a patch to prune the redundant WineTest reports. However it's not very effective because some tests produce errors that change with every run. Those will clearly need to be fixed.
Francois Gouget (25): testbot: Standardize building the reference report filenames. testbot/LogUtils: Don't rename the old logs on Engine startup. testbot/UpdateTaskLogs: Delete, rebuild or add reference reports. testbot/LogUtils: Tag new errors instead of filtering them. testbot/LogUtils: Return the log errors as a single object. testbot/LogUtils: Add _AddLogError() and _AddLogGroup(). testbot/LogUtils: Rename the $Summary variables to $LogInfo. testbot/LogUtils: Tweak ParseTaskLog() to reduce the indentation level. testbot/LogUtils: Return the ParseWineTestReport() results as a single object. testbot/LogUtils: Add BadRef/NoRef to TagNewErrors(). testbot/LogUtils: Rename the ParseTaskLog() NoLog error field to BadLog. testbot/JobDetails: Highlight new errors in the full log too. testbot/LogUtils: Keep track of the first line of each error group. testbot/LogUtils: Store the filename in the $LogInfo structures. testbot: Share code to access the latest reference WineTest reports. testbot/UpdateTaskLogs: Delete reference reports that are not supposed to exist. testbot: Move reference logs to the task and prepare for having more than one. testbot: Separate the per-task task and testbot logs. testbot/LogUtils: Reorder the functions to get a more logical grouping. testbot: Cache the log error lists in .errors files. testbot/JobDetails: Improve the error reporting. testbot/LogUtils: Share code between _SaveLogErrors() and _DumpErrors(). testbot/LogUtils: Return the errors in the Parse*() $LogInfo structure. testbot: Take into account multiple WineTest reports. testbot/LogUtils: Deduplicate the latest WineTest reports.
testbot/bin/Engine.pl | 14 - testbot/bin/Janitor.pl | 9 +- testbot/bin/UpdateTaskLogs | 559 +++++++++++++ testbot/bin/WineRunBuild.pl | 49 +- testbot/bin/WineRunReconfig.pl | 24 +- testbot/bin/WineRunTask.pl | 78 +- testbot/bin/WineRunWineTest.pl | 118 +-- testbot/bin/WineSendLog.pl | 143 ++-- testbot/lib/WineTestBot/LogUtils.pm | 1161 ++++++++++++++++++--------- testbot/lib/WineTestBot/Tasks.pm | 50 +- testbot/web/JobDetails.pl | 146 +++- testbot/web/WineTestBot.css | 2 + 12 files changed, 1700 insertions(+), 653 deletions(-) create mode 100755 testbot/bin/UpdateTaskLogs