https://bugs.winehq.org/show_bug.cgi?id=48166
Bug ID: 48166 Summary: test.winehq.org Provide a way to track individual failures Product: WineHQ.org Version: unspecified Hardware: x86 OS: Linux Status: NEW Severity: normal Priority: P2 Component: www-unknown Assignee: wine-bugs@winehq.org Reporter: fgouget@codeweavers.com Distribution: ---
A test unit may have multiple unrelated test failures. Some may fail on recent Windows 10 machines while others may only happen in certain locales or some graphics cards. Untangling these can actually be automated.
The root is to merge the failures of two reports together while associating each failure with the tag or the report(s) it originates from. To do so diff the two error lists. * Failures that are present in both get both tags. * Failures that are only present in the first report get only that tag. * Same thing for failures that are present only in the second report. And all failures are integrated in the merged list in the order they are returned in by Algorithm::Diff.
Once two failure lists have been merged together, it's possible to continue merging more failure lists. This allows getting a unified list of the failures of a given commitid, and appending that commitid to the tags allows building a complete list of all the available history.
Then one can group all failures that have the exact same set of tag+commitids combinations together. Then, since the failures in different groups don't all happen together they must depend on different factors.
Intermittent failures, timeouts, crashes ----------------------------------------
If two failures are related but a random timeout or crash sometimes occurs between them they might end up being incorrectly split in two separate groups.
So if a crash or a timeout occurs, any other failure in that run should be ignored when grouping failures together.
This can be achieved by prefixing the tag with a '*' if a crash or timeout occured. Then, when grouping failures together, ignore the entries where the tag starts with a '*'. But when building the occurence pattern, do use the entries starting with a '*' to show all the test runs where at least one of the failures in the group occurred.
Because entries starting with a '*' are ignored when building failure groups, failures that cause a crash/timeout and the crash/timeout line itself will be part of no failure group. So do a second pass over the unassigned failures, this time not ignoring the entries with a '*'. This will create groups composed of the crash/timeout and any related failures.
New failures ------------
This analysis indicates where and when a given failure group happened. This means it can also detect new failures.
It would not be useful to define a new failure as one that never happened before the latest commit: blink and you might miss it. Instead it should be expanded to all failures that only happened in the first half of the available history. This may sometimes falsely treat rare intermittent failures as new but that should be rare enough.
Presentation of the results ---------------------------
The results can be presented on a page with one box per test unit like the 'Full Report' pages. A 'details' link under the test unit name on the test failrues pattern page would link to the relevant section of the full page.
Inside each box there would be a sequence of lines showing the failures in a group, followed by the usual pattern showing which machines the failure happens on; then the next failure group, etc. For instance:
console.c:270: Test failed: got 16, expected 6 console.c:275: Test failed: got 16, expected 6 .....F..F...F..F.mmm Win8 vm1 ......FF.e...FF.e..F Win8 vm1-ja
096c:console: unhandled exception c0000005 at 6F384E33 .....CC Win8 vm2-new
As usual the items in the pattern would link to the relevant pages, allowing to dig deeper into the issue. The same color coding would be used for the pattern but since failure groups always have the same number of failures only one color would be used for the F code.
The failure line number will change from one run to the next so zero it out or only retain one value at random. Similarly, if the failure contains timing information (see bug 48094), remove the timing information.
If a failure is identified as new, put its lines in bold orange, like on the TestBot. This will allow quick identification of the new failures on the page.
https://bugs.winehq.org/show_bug.cgi?id=48166
--- Comment #1 from François Gouget fgouget@codeweavers.com --- The proposed merge algorithm does not quite work.
The problem is that when looking at the diff we don't know if the '-' lines come before or after the '+' ones, or if they are interleaved.
To simplify things, assume the failure messages are a simple digit. Then if we have rep1 = [1 2 3 4] and rep2 = [5] we don't know if the merge should be: rep1+2 = [1 2 3 5] or = [1 2 5 3] or = [1 5 2 3] or = [5 1 2 3]
The impact is on later merges: if rep1+2 = [1 2 3 5] and rep3 = [5 3], then the diff will give us: -1 -2 -3 5 +3 and the merge would be something like [1 2 3 5 3].
So now failure 3 is in two places, making it appear as if those were two separate failures. Future merges will match either one or the other, so that the analysis will get an incomplete picture of when and where the failure happened.
Having more context could help avoid these issues, so long as the extra context does not itself change. So that may not work great.
Another approach would be to consider that failure messages are unique. But that assumption is not really all that true: egrep '(Test failed|Test succeeded)' report | sed -e 's/^([a-z0-9.]*):[0-9]*:/\1:/' | sort | uniq -c | sort -n | egrep -v '^ *1 '