I have been reviewing the TestBot and GitLab CI test results for the merged MRs. While doing that I updated the TestBot's known failures list (https://testbot.winehq.org/FailuresList.pl) in order to drive down the false positive rate.
Incidentally I also collected the list of test units causing false positives, so I'll start with that. Specifically, here are the bugs to fix to help the GitLab CI:
* Bug 53433 - mmdevapi:capture - impacted 18 MRs * Bug 54064 - ntdll:threadpool - impacted 15 MRs * Bug 54078 - ntdll:pipe - impacted 11 MRs * Bug 54140 - mmdevapi:render - impacted 5 MRs * Bug 54005 - ole32:clipboard - impacted 5 MRs * Bug 54037 - user32:msg - impacted 5 MRs * Bug 54074 - ws2_32:sock - impacted 5 MRs
I classified the TestBot / GitLab CI results as follows:
* False positive Cases where the CI system incorrectly claimed the MR introduces new failures. This is typically the case when the failures that are already present in nightly WineTest results.
* Bad merge
MRs that break a test and got merged anyway.
* Collateral Damage from a bad merge
The false positives (aka collateral damage) caused by one of the bad merges above.
* Outside interference
This identifies false positives that are not random and intrinsic to the test but that result from change outside the Wine infrastructure, for instance certificates that expire, or configuration changes to servers that break the tests that depend on it.
Of those the only ones that a CI can really avoid are the first type, aka "False positive". So I calculated the corresponding weekly rate:
Adjusted False Positive rate Week | TestBot | GitLab CI 2022-11-14 | 21.9% | 8.3% 2022-11-21 | 8.0% | 21.6% 2022-11-28 | 14.7% | 28.4% 2022-12-05 | 8.5% | 24.5% 2022-12-12 | 0.0% | 20.0%
Note that the TestBot's 8% rate for the 11-21 week is not representative because Wine was broken that week (collateral damage) which prevented the tests from running in Wine, and thus from contributing real "false positives". Also the 12-12 week is still incomplete obviously.
Even so I think his shows the TestBot is improving.
Here's a list of the incidents for the weeks above: * 11-14 An external certificate revocation issue caused crypt32:cert to fail systematically. This impacted 14 merge requests and was fixed in MR1360.
* 11-17 MR!1399 got merged despite the TestBot detecting that it prevented 32-bit Wine tests from running to completion. This impacted 39 merge requests. I could have reduced that number if I had been faster to reconfigure the TestBot to stop running the full 32-bit Wine test suite. This was fixed in MR!1524.
* 11-17 MR!1398 got merged despite the TestBot detecting that it broke ntoskrnl.exe:ntoskrnl on Windows 7. This was fixed in MR!1803.
* 11-22 MR!1495 got merged despite the TestBot detecting that it broke vbscript:run on Windows *. I don't have a record of the impacted MRs or of when it was fixed.
* 11-23 The b00a831d direct commit broke kernel32:process in Wine. This got fixed since.
* 12-07 MR!1732 got merged despite the TestBot detecting that it broke taskschd:scheduler on Windows *. I immediately added a known failure entry and no MR got impacted. This was fixed in MR!1736.
If not filtering out the failures caused by these incidents, the false positive rate is:
Raw False Positive rate Week | TestBot | GitLab CI 2022-11-14 | 52.1% | 27.1% 2022-11-21 | 50.0% | 29.5% 2022-11-28 | 20.0% | 33.7% 2022-12-05 | 19.1% | 57.4% 2022-12-12 | 0.0% | 20.0%
I think that also shows that the TestBot is improving.
I have attached the raw data I collected and shell snippets to extract various statistics (failures-mr.txt) as well as a spreadsheet import (failures.xls).
In parallel with the MR test results I analyzed the "new" test failures in each nightly WineTest run and updated the TestBot known failures accordingly.
A "new" failure in the WineTest results is equivalent to a "false positive" in the MR tests: a lot of them are not new but rare failures or ones where the messages changes in every run due to the presence of random values (e.g. pointers, handles).
Analyzing the WineTest results is particularly important for the Windows VMs: * The TestBot runs the full 32-bit test suite in Wine for every MR which ensures a good coverage of the tests and a lot of opportunity to detect flaky tests. * But on Windows the TestBot only runs the tests directly impacted by the MR which provides less opportunity for detecting which tests are flaky on Windows.
So analyzing the nightly Windows WineTest results provides more opportunity for detecting which tests are unreliable on Windows before they cause trouble for MRs.
The results show improvements: they are now regularly below 10 new failures whereas before 11-29 there was over 20 in each run (and almost 20 when deduplicated, see the attached spreadsheet). There's quite a bit of noise so we'll know more in the coming weeks.
Also the Windows 11 test configurations don't have too many new failures anymore so it should be possible to test the MRs against them soon.
I attached the raw data in failures-winetest.txt and an updated spreadsheet (failures.xls).
The false positives rate remained pretty low for the TestBot over the past week, both in terms of impact to the merge requests and in the nightly WineTest results.
Despite having a lot more test configurations, last week only about 6% of MRs got a false positive versus 15% that got one from the GitLab CI's single Wine test environment.
That means less annoying emails from Marvin but also that when the TestBot now reports an issue it's really best to pay attention to it.
Here's an updated list of the test bugs causing the most false positives in the GitLab CI:
18 mmdevapi:capture -> bug 53433 15 ntdll:threadpool -> bug 54064 11 ntdll:pipe -> bug 54078 5 ws2_32:sock -> bug 54074 5 wldap32:parse -> bug 54075 5 user32:msg -> bug 54037 5 ole32:clipboard -> bug 54005 5 mmdevapi:render -> bug 54140
And here's an update to last week's statistics: (I attached the updated data files)
Adjusted MR False Positive rate Week | TestBot | GitLab CI 2022-11-14 | 21.9% | 8.3% 2022-11-21 | 8.0% | 21.6% 2022-11-28 | 14.7% | 28.4% 2022-12-05 | 8.5% | 24.5% 2022-12-12 | 6.1% | 15.2% 2022-12-19 | 0.0% | 15.0% (incomplete week)
Raw MR False Positive rate Week | TestBot | GitLab CI 2022-11-14 | 52.1% | 27.1% 2022-11-21 | 50.0% | 29.5% 2022-11-28 | 20.0% | 33.7% 2022-12-05 | 19.1% | 57.4% 2022-12-12 | 6.1% | 15.2% 2022-12-19 | 0.0% | 15.0% (incomplete week)
Test Units with "new" failures in the nightly WineTest results Week | Total | Deduplicated 2022-11-21 | 96 | 51 2022-11-28 | 82 | 67 2022-12-05 | 108 | 53 2022-12-12 | 47 | 42 2022-12-19 | 14 | 14
Here's an update on the merge request and nightly Wine test runs false positive rates.
Reminder: A false positive (FP) is when the TestBot or GitLab CI say a failure is new when it is not.
* TestBot The FP rate is still between 5 and 10% (see attached graphs). Now we have more history data so we can see that the FP rate went steadily down from 20% to 5% in December, i.e. during the freeze and when I was first populating the TestBot's list of known failures. https://testbot.winehq.org/FailuresList.pl
Then in the month of January the average rate gradually went back up to about 10%. I chalk it up to more risky commits being allowed again. It would be nice for the FP rate to go back down to 5% but it's not clear if that will happen.
* GitLab CI The GitLab CI's FP rate also went down in December, hitting a low of 10% for the new year. But in January it immediately went up again. Combined with the high November FP rate, the December dip is not really visible on the 5 week average.
As I said, the FP rate has been going up since the new year. Again I think that's the effect of more risky commits going in. That shows on the 5 week average which is now between 25% and 30%, higher than ever before :-(:
Unlike on the TestBot, the GitLab CI has no way to ignore known false positives. So if you don't want the GitLab CI claiming your merge requests introduce new failures, the only way is to fix the tests. And I guess that's not a bug. It's a feature [1].
Where to start you may ask?
A good place would be the test units that cause the most false positives:
22 dinput:device8 17 ntdll:threadpool 16 user32:msg 9 d3d11:d3d11 7 ws2_32:afd 6 ws2_32:sock 6 user32:win 6 ole32:clipboard
And among those, some failure modes are particularly troublesome:
17 dinput:device8 -> bug 54594 16 ntdll:threadpool -> bug 54064 9 d3d11:d3d11 -> bug 54510 8 user32:msg -> bug 54037 7 ws2_32:afd -> bug 54113 6 ole32:clipboard -> bug 54005
That is, user32:msg, for instance, can fail in many different ways but among the 16 times it caused a false positive (first list), 8 of them were because of the specific failure described in bug 54037 (second list).
[1] Not that it ever worked for the TestBot.
It's been about a year since I started collecting data and also since the GitLab CI has been introduced. So here's an update on the merge request and nightly Wine test runs false positive rates.
Reminder: A false positive (FP) is when the TestBot or GitLab CI says a failure is new when it is not.
* TestBot
The FP rate stayed around 10% until the end of August when the GitLab bridge to the mailing list got broken (see graphs). Looking at it differently, except for June on a given day there was a better than 40% chance that less than 10% of the MRs would get a false positive (and >70% chance for < 25%).
But with the bridge gone the TestBot failures are not relayed to the MRs anymore and thus collecting data is impractical and quite irrelevant too.
* GitLab CI
The GitLab CI's FP rate was stayed below 30% until mid May but it has stayed clearly above since then. The 5 week average even reached a peak of 60% in early August and it's not getting really better.
Changing perspective, since March less than 20% of the days had a false positive rate below 10%. And in August and September every single day had more than 10% of false positives.
Also, before August the chances of having an FP rate lower than 25% were much greater, usually 40% or more. But that rate has plummeted and is now below 10%.
The 50% FP line shows great swings which I think are caused by periods where one or more tests has a 100% failure rate and does not get fixed for weeks. Still, in early 2023 it was at 85% or more but since then there has been a clear downward trend where the both the peaks and troughs keep getting lower.
Conclusions:
* I hoped the TestBot FP rate would improve but it has only held steady. It may be that this 10% failure rate is incompressible because of the delay between when a new failure pops up and when the TestBot knows how to identify it (i.e. when I added it to the known failures page: https://testbot.winehq.org/FailuresList.pl).
Stemming the flow of new failures introduced by bad MRs may help lower that rate. But new failures can also happen when a certificate expires, when a test server goes down, or when changing the build platform for instance. So there will likely always be a residual FP rate.
* The GitLab CI seemed to make progress at first but since mid March it has been getting away from the goal of having no false positives.
Notes:
* Comparing the TestBot and GitLab CI failure rates is akin to comparing apples and oranges.
The GitLab CI does a single full test suite (except for a handful of tests) run in Wine (plus a single 64-bit test).
The TestBot does: * 1 full 64-bit run in Wine (no exceptions), * 1 run of modified tests in a Windows-on-Windows Wine environment, * 1 run of all tests of modified modules in Wine, * 7 plain 32-bit Wine runs in various locales, * 24 tests in various Windows, locale, GPU and screen layout configurations.
And it still gets 1/2 to 1/3 the false positive rate.
* Improving the false positive rate does not mean that the Wine tests have fewer failures. But getting reliable results from the CI was deemed to be a necessary step for developpers to trust it and know they need to rework their MR when the results are bad.
It also means less work for the maintainer to discriminate between MRs that introduce new failures and those that don't. And less chance to make mistakes too.
* Conversely, improving the tests does not necessarily improve the false positive rate. We have 230 failing test units so one can fix 229 of them but if the last one fails systematically the false positive rate will stay pegged at 100%.
Reducing the number of false positives requires either focusing on the tests that cause them, or having counter measures built into the CI... as is the case for the TestBot.
There have been quite a lot of new failures in the past few builds which makes it easy to forget the Wine test results are getting better.
The progress is not actually visible on the regular patterns page because it is too slow to show up across the past two months. It's when looking at 8 months of results that one can see it [1]:
http://fgouget.free.fr/wtb/patterns.html
There one can see that back in July we had 9 test configurations which sometimes got under 10 failures while now we have over 40 that achieve it pretty consistently! And where the record low was 8 failing test units, now it is 3.
In more details: * The Debian 11 VMs went from about 12 failures to a mere 4 failures now, not just in the English locale but in a couple others too.
* The Debian Testing VM went from about 20 failures to about 8 now.
* The GitLab CI also went from about 15 failures to about 4 now.
* Windows 11 went from 50+ failures to about 25. Big progress there! (But the AMD GPU still has an extra 10.)
* Windows 10 22H2 went from over 10 failures in November to about 5 now.
* Windows 10 21H1 went from about 20 failures to just under 10, most of the time.
* The Windows UTF-8 configurations went from about 35 failures to around 20 now.
* The full Wine test suite ran successfully not once but 8 times on Windows 10 1507 (that's still only ~7% of successful runs since the first success).
* Windows 10 1809 has been consistently below the 10 failures threshold since mid-October, and 1709 and 1909 joined it mid-January.
But as I said progress is slow and without everyone's efforts the results could very well regress.
[1] Note that a lot of links on that page are broken because they point to the main test.winehq site which does not have 8 months of history. Also I could not duplicate the 97 GB of pages to my website.