It's been about a year since I started collecting data and also since the GitLab CI has been introduced. So here's an update on the merge request and nightly Wine test runs false positive rates.
Reminder: A false positive (FP) is when the TestBot or GitLab CI says a failure is new when it is not.
* TestBot
The FP rate stayed around 10% until the end of August when the GitLab bridge to the mailing list got broken (see graphs). Looking at it differently, except for June on a given day there was a better than 40% chance that less than 10% of the MRs would get a false positive (and >70% chance for < 25%).
But with the bridge gone the TestBot failures are not relayed to the MRs anymore and thus collecting data is impractical and quite irrelevant too.
* GitLab CI
The GitLab CI's FP rate was stayed below 30% until mid May but it has stayed clearly above since then. The 5 week average even reached a peak of 60% in early August and it's not getting really better.
Changing perspective, since March less than 20% of the days had a false positive rate below 10%. And in August and September every single day had more than 10% of false positives.
Also, before August the chances of having an FP rate lower than 25% were much greater, usually 40% or more. But that rate has plummeted and is now below 10%.
The 50% FP line shows great swings which I think are caused by periods where one or more tests has a 100% failure rate and does not get fixed for weeks. Still, in early 2023 it was at 85% or more but since then there has been a clear downward trend where the both the peaks and troughs keep getting lower.
Conclusions:
* I hoped the TestBot FP rate would improve but it has only held steady. It may be that this 10% failure rate is incompressible because of the delay between when a new failure pops up and when the TestBot knows how to identify it (i.e. when I added it to the known failures page: https://testbot.winehq.org/FailuresList.pl).
Stemming the flow of new failures introduced by bad MRs may help lower that rate. But new failures can also happen when a certificate expires, when a test server goes down, or when changing the build platform for instance. So there will likely always be a residual FP rate.
* The GitLab CI seemed to make progress at first but since mid March it has been getting away from the goal of having no false positives.
Notes:
* Comparing the TestBot and GitLab CI failure rates is akin to comparing apples and oranges.
The GitLab CI does a single full test suite (except for a handful of tests) run in Wine (plus a single 64-bit test).
The TestBot does: * 1 full 64-bit run in Wine (no exceptions), * 1 run of modified tests in a Windows-on-Windows Wine environment, * 1 run of all tests of modified modules in Wine, * 7 plain 32-bit Wine runs in various locales, * 24 tests in various Windows, locale, GPU and screen layout configurations.
And it still gets 1/2 to 1/3 the false positive rate.
* Improving the false positive rate does not mean that the Wine tests have fewer failures. But getting reliable results from the CI was deemed to be a necessary step for developpers to trust it and know they need to rework their MR when the results are bad.
It also means less work for the maintainer to discriminate between MRs that introduce new failures and those that don't. And less chance to make mistakes too.
* Conversely, improving the tests does not necessarily improve the false positive rate. We have 230 failing test units so one can fix 229 of them but if the last one fails systematically the false positive rate will stay pegged at 100%.
Reducing the number of false positives requires either focusing on the tests that cause them, or having counter measures built into the CI... as is the case for the TestBot.