https://bugs.winehq.org/show_bug.cgi?id=48164
Bug ID: 48164 Summary: test.winehq.org should provide an efficient way to detect new failures Product: WineHQ.org Version: unspecified Hardware: x86 OS: Linux Status: NEW Severity: normal Priority: P2 Component: www-unknown Assignee: wine-bugs@winehq.org Reporter: fgouget@codeweavers.com Distribution: ---
Problem -------
test.winehq.org does not allow performing the following tasks efficiently: 1. Detecting when a new failure slips past the TestBot. One can detect new failures on the per test unit page when specific columns turn red. But quite often the test unit already has failures so one has to look at the specific number of failures. Furthermore many test units currently have failures so this requires checking 80+ pages individually.
2. Detecting when the results on a VM degrade. After upgrading a machine it's useful to compare it to its previous results. But the results for each date are on separate pages. So again it's necessary to check the per-test-unit result pages.
3. Comparing the results of two machines of different platforms. For instance comparing the results of running Windows 8 to those of Windows 10 on the same hardware.
Other things that got asked: 4. Sometimes it would be nice to have only the failures, and not all the lines with skipped tests and todos.
5. In some cases it would also be nice to have pages with only the failures that happen on TesBot VMs since these are usually easier to reproduce.
Jeremy's page -------------
Jeremy's test summary page can help with some of that: https://www.winehq.org/~jwhite/latest.html
But: * It's not integrated with test.winehq.org which makes it hard to find. * There are only two states: Success and Failed: So it does not help when a test goes from having 2 failures to 4, or when it has a set of systematic failures and a set of intermittent ones. * The failed / success pattern is not per VM which masks some patterns and does not help with point 2.
Proposal --------
A modified version of Jeremy's page could be integrated with test.wnehq.org:
* It would actually be a pair of 'Failures' pages, one for TestBot VMs and one for all test results. Both would be linked to from the top of the main index page, for instance using the same type of 'prev | next' text links used on the other pages.
* Jeremy's result matrix would be extended from three to four dimensions; test units, test results, time, and number/type of failures.
* As before the results would be grouped per test unit in alphabetical order. Only the test units having at least one error, recent or not, would be shown. This could again be in the form of an array ('full report' pages on test.winehq.org) or simply test unit titles (TestBot jobDetails page style) with the information about each test unit inside. Clicking on the test unit name would link to its 'test runs' page on test.winehq.org.
* For each test unit there would be one line per test result having errors. The first part of the line would have one character per commit for the whole history available on test.winehq.org. That character would indicate if the test failed and more. The second part of the line would be the test result platform and tag. They would be sorted per platform and alphabetically.
* Each test result would get a one character code: . Success F Failure C Crash T Timeout m Missing dll (foo=missing or other error code) e Other dll error (foo=load error 1359 and equivalent) _ No test (the test did not exist) ' ' No result (the machine did not run the tests that day)
* These codes would be shown using a monospace font so they would form a pattern across time and test results: .....F..F...F..F.mmm Win8 vm1 .....FFFFeFFFFFFeFFF Win8 vm1-ja ...TTCC Win8 vm2-new ......eF...F...F..F. Win10 vm3
* Each character would have a tooltip containing details like the meaning of the letter, the number of failures, or the dll error message. They would also link to the corresponding section of the test report.
* In addition to the character the background would be color coded to make patterns more visible. . Green F Green to yellow to red gradient C Dark red T Purple/pink m Cyan e Dark blue _ Light gray ' ' White
* The green-yellow-red gradient would be what allows detecting changes in the number of test failures. That gradient must be consistent for all lines of a given test unit's pattern. Furthermore the gradient must not be computed based on the test result's number of failures. That is, if a test unit has either 100 or 101 failures, those must not have nearly indistinguishable colors. Instead the set of all different failure counts for the test unit should be collected. Zero should be added to that set. Then these values should be sorted and a color attributed for each *index*. Then the background color is selected based on the index of that result's failures count. It is expected that each set will be relatively small so that the colors will be reasonably far apart, making it easy to distinguish a shift from 4 to 6 failures even if there are 100 failures from time to time. Also note hat adding zero to the set essentially reserves green for successful results.
Implementation feasibility --------------------------
* No changes in dissect.
* In gather, generate a new testunits.txt file containing one line per test unit: - The data source would be the per-report summary.txt files. -> These don't indicate when a timeout has occurred so timeouts will appear as F instead which is acceptable for a first implementation. - The first line would contain a star followed by the tags of all the test runs used to build the file: - The other lines would contain the name of the test unit followed by space-separated pairs of result code/failure count and result tag (including the platform). - A line would be put out even if the test unit had no failure.
For instance, the commit1 testunit.txt file could contain: * win8_vm1 win8_vm1-ja win8_vm2-new win10_vm3 foo:bar 43 win8_vm1-ja C win8_vm2-new e win10_vm3 foo:bar2
- In the example above win8_vm1 only appears on the first line. This means WineTest was run on that machine but had no failure at all. - If the results for commit2 refer to a win8_vm4 machine, we will know that the reason win8_vm4 does not appear in commit1 file is not because all the tests succeeded, but because WineTest was not run on win8_vm4 for commit1. This means that the result code for win8_vm4 for commit1 should be ' '. not '.' for all test units. - If commit2 has results for the foo:bar3 test unit, then we will know the reason it is not present in the commit1 file is not because all the test runs were successful, but because foo:bar3 did not exist yet. So its result code would be '_', not '.'.
* Add a new build-failures script to generate both failures pages. - This script will need to read the testunits.txt file for all the commits. The simplest implementation will be to read all the data into memory before generating the page. This will avoid having to deal with keeping the current test unit synchronized between all of the testunits.txt files when a new test unit has been added.
- The combined size of the testunits.txt files is expected to be reasonable, within a factor of 3 of the summary.txt files. For reference, here is some data about the sizes involved: $ du -sh data 21G data $ ls data/*/*/report | wc -l 2299 $ cat data/*/*/report | wc 34,087,987 231,694,407 2,104,860,095 $ cat data/*/*/report | egrep '(: Test failed:|: Test succeeded inside todo block:|done [(]258)|Unhandled exception:)' | wc 567,158 6,275,504 53,202,999 $ cat data/*/summary.txt | wc 186,219 3,046,363 30,596,901
- Having a function to generate the page will allow calling it twice in a row to generate both pages without having to load and parse the testunits.txt files twice.