TL;DR; Making lasting progress on fixing the Wine tests requires TBD policy changes to incentivize working on them.
While there has definitely been progrees on the Wine tests front lately (and I'm grateful for all the Wine developpers who made it possible), I think there are structural problems that will prevent us from ever getting close to zero failures.
Why would a developer work on fixing Wine test bugs?
The naive thinking was that they would do so out of pride in the Wine code. Or that they would jump at the chance of fixing issues where all the source is available. But if those where the only factors we would not still have 200 hundred failing test units afer two decades.
Another part of the naive thinking was that if developers did not work on the tests it was just because the tools needed to be improved or the test environments were too unreliable. So for a long time all efforts have centered on that. But I think the CI is good enough now to not be the main obstacle [1].
The inconvenient truth is there are forces that discourage working on the Wine tests:
* Most developers come to Wine to fix a game or application they want to run, or to add some sexy new functionality. They will prefer to work on these over fixing tests.
* Professional developers will be asked to work on issues for paying customers, not fixing tests.
* There is a big exception to the source availability: Windows. So figuring out the logic behind the Windows behavior can still be maddeningly frustrating.
* Replicating failures can sometimes be frustratingly hard. Even if you happen to have access to the CI environment! [2]
* And usually one does not have access to the CI environment. That's the the case for the Windows environments for licensing reasons. On the Linux side the GitLab CI should mean progress but replicating it needs to be made much simpler before that's effective. In the meantime that means no debugger, no watching what happens on screen while the test runs, little poking into the test environment entrails (short of writing a dedicated program to do the poking), etc. In short it's remote debugging.
* The CI itself can be an obstacle when it systematically reports failures unrelated to the current MR (a warning light that is on all the time is no warning light at all). [3]
* The GitLab CI has been allowed to have a 99% failure rate for days on end (see failures-mr.txt in my 2023-10-20 email). That tells the developers that fixing tests is unimportant.
There is also nothing to counterbalance that and push developers to work on Wine tests. A developer will not see their patches reverted if they don't fix the failures they introduce. In fact I'm not sure introducing new failures in Wine has any negative consequences. And that's assuming the commit causing the failures is ever identified. Developers fixing tests don't get rewards either.
So the conclusion I have come to is that making further and lasting progress will require policy changes to incentivize work on the tests. This touches on the domain of social sciences so there will be no obvious 'right fix'. It's also something I cannot do myself since I don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully we can decide on something that works for Wine:
* Revert commits that introduced new failures. - Do it the very next day if the failure is systematic? - What if the failure is random and only happens once a day? Or once a week? - What if the failure does not impact the CI? For instance if the CI has no macOS test configuration and the failure only happens on macOS. - Should only the test part of the MR be reverted? (if that's the cause of the failure) - Who makes the decision to revert? Alexandre? A dedicated person who will catch a lot of flak?
* Block an author's new MRs if they did not fix failures introduced by one of their previous commit. - This has the potential to slow down Wine development. - Or the author could request their previous commit to be reverted to get unblocked.
* If the CI shows failures, block the MR. - That can still cause the Wine development to halt when the CI has a 100% failure rate (as has been the case for the GitLab CI recently). - So it's only doable if the false positive rate is pretty low. But then it's likely to just result in the developer trying their luck with the next CI run.
* If the CI shows failures, require that the author explain why they are not caused by their MR before it can be considered for merging. - The TestBot's new failure lookup tool would be a good way to prove the failure pre-existed the MR. https://testbot.winehq.org/FailuresList.pl - This is a softer version of the previous option and should not block Wine development. It may also push developers to fix the failures out of frustration at having to explain them away again and again since they cannot ignore them. - Reviewers would also be responsible for verifying that the explanations are accurate and for objecting to the merge if not. - Determining if the CI results contain new failures would not longer fall on Alexandre alone.
* Have someone dedicated to tracking the source of new failures, reporting them to the authors, following up on the failures, asking for progress on a fix, etc. This would be a developer role bordering on community manager.
* Do away with direct commits? (see the 'New failures analysis' email)
* Use the Wine party fund to pay developers to fix test bugs.
* Send swag to developers who fixed 10 or more test failures. Or set rewards for fixing specific test units like user32:msg, d3d11:d3d11, etc.
* Point test.winehq.org/ to the patterns page instead of the index page: the patterns page better reflects the tests progress [4] and thus is less discouraging than the main index page. https://gitlab.winehq.org/winehq/tools/-/merge_requests/71
Suggestions welcome. Hopefully we can figure something out that will get us on the path to zero failures.
[1] There are still a few infrastructure improvements that could help: - Fix the TestBot integration with GitLab so its reports don't go into a black hole. At least, while I cannot fix the bridge, I think I can replace it with a better solution. - Allowing to run the tests with WINEDEBUG. - Taking screenshots while the test runs. - Make it easier for developers to replicate the GitLab CI debian environment, build their Wine in it, run the tests, interact with the virtual X server, etc.
[2] I will very humbly claim that the wt-test-bisect tool represents a huge step forward on that front when the faiure only happens in full test suite runs. https://gitlab.winehq.org/fgouget/wt-daily/-/blob/bisectors/wt-test-bisect?r...
[3] Though the GitLab CI false positive rate needs to get better. However my understanding is that the strategy there is to not have the CI try to distinguish false positives from the real issues. Rather my understanding is that it's a two prong approach: 1. Summarily hide unreliable tests by marking them as flaky. 2. Let all the systematic failures annoy the developers so they fix them out of frustration. However this is essentially the naive strategy that was used for years with the TestBot and that only resulted in the developers dismissing and ignoring the CI results.
[4] For instance the patterns page makes it clear that there has been progress for Windows 10 1809 whereas that is invisible on the index page.