Hi!
On 10/23/23 14:29, Francois Gouget wrote:
TL;DR; Making lasting progress on fixing the Wine tests requires TBD policy changes to incentivize working on them.
While there has definitely been progrees on the Wine tests front lately (and I'm grateful for all the Wine developpers who made it possible), I think there are structural problems that will prevent us from ever getting close to zero failures.
Why would a developer work on fixing Wine test bugs?
The naive thinking was that they would do so out of pride in the Wine code. Or that they would jump at the chance of fixing issues where all the source is available. But if those where the only factors we would not still have 200 hundred failing test units afer two decades.
Another part of the naive thinking was that if developers did not work on the tests it was just because the tools needed to be improved or the test environments were too unreliable. So for a long time all efforts have centered on that. But I think the CI is good enough now to not be the main obstacle [1].
The inconvenient truth is there are forces that discourage working on the Wine tests:
- Most developers come to Wine to fix a game or application they want to run, or to add some sexy new functionality. They will prefer to work on these over fixing tests.
Speaking for myself I'm generally annoyed by failing tests and red MR statuses, but fixing tests *is* time and energy consuming, especially when you're fixing tests unrelated to your current work, just to get the green status back.
- Professional developers will be asked to work on issues for paying customers, not fixing tests.
Fixing tests is an essential part of fixing issues. The tests are our behavior measurement tool. If it's broken we are simply unable to measure Windows behavior and can't do any meaningful work. I think writing and fixing tests is even more part of the job when people get paid for implementing a feature?
There is a big exception to the source availability: Windows. So figuring out the logic behind the Windows behavior can still be maddeningly frustrating.
Replicating failures can sometimes be frustratingly hard. Even if you happen to have access to the CI environment! [2]
And usually one does not have access to the CI environment. That's the the case for the Windows environments for licensing reasons. On the Linux side the GitLab CI should mean progress but replicating it needs to be made much simpler before that's effective. In the meantime that means no debugger, no watching what happens on screen while the test runs, little poking into the test environment entrails (short of writing a dedicated program to do the poking), etc. In short it's remote debugging.
It would be convenient if we could run a winetest.exe command from a patch on the testbot, instead of the automatically decided test. We can run a locally built winetest.exe, but it's not as flexible as building a custom Wine, that a patch allows. This would be useful for debugging some bad tests interactions.
The CI itself can be an obstacle when it systematically reports failures unrelated to the current MR (a warning light that is on all the time is no warning light at all). [3]
The GitLab CI has been allowed to have a 99% failure rate for days on end (see failures-mr.txt in my 2023-10-20 email). That tells the developers that fixing tests is unimportant.
There is also nothing to counterbalance that and push developers to work on Wine tests. A developer will not see their patches reverted if they don't fix the failures they introduce. In fact I'm not sure introducing new failures in Wine has any negative consequences. And that's assuming the commit causing the failures is ever identified. Developers fixing tests don't get rewards either.
So the conclusion I have come to is that making further and lasting progress will require policy changes to incentivize work on the tests. This touches on the domain of social sciences so there will be no obvious 'right fix'. It's also something I cannot do myself since I don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully we can decide on something that works for Wine:
- Revert commits that introduced new failures.
- Do it the very next day if the failure is systematic?
- What if the failure is random and only happens once a day? Or once a week?
- What if the failure does not impact the CI? For instance if the CI has no macOS test configuration and the failure only happens on macOS.
- Should only the test part of the MR be reverted? (if that's the cause of the failure)
- Who makes the decision to revert? Alexandre? A dedicated person who will catch a lot of flak?
I don't think reverting commit is a good solution, We try to avoid it as much as possible in general as it makes bisection more difficult. If we could find another way it's probably better.
Block an author's new MRs if they did not fix failures introduced by one of their previous commit.
- This has the potential to slow down Wine development.
- Or the author could request their previous commit to be reverted to get unblocked.
If the CI shows failures, block the MR.
- That can still cause the Wine development to halt when the CI has a 100% failure rate (as has been the case for the GitLab CI recently).
- So it's only doable if the false positive rate is pretty low. But then it's likely to just result in the developer trying their luck with the next CI run.
If the CI shows failures, require that the author explain why they are not caused by their MR before it can be considered for merging.
- The TestBot's new failure lookup tool would be a good way to prove the failure pre-existed the MR. https://testbot.winehq.org/FailuresList.pl
- This is a softer version of the previous option and should not block Wine development. It may also push developers to fix the failures out of frustration at having to explain them away again and again since they cannot ignore them.
- Reviewers would also be responsible for verifying that the explanations are accurate and for objecting to the merge if not.
- Determining if the CI results contain new failures would not longer fall on Alexandre alone.
The move to Gitlab and merge requests supposedly also involved relying more on the CI status, and MR were supposedly not going to be merged on failure. I think it's the way to go, except held for some time, with a few exceptions to the rule that broke all MR from time to time.
IMO that rule was good, and that we should stick to it more closely. The issue now is that recently, all MR have started to systematically fail, so following that rule we'd not merge anything.
A good solution to that, in order not to block further progress, would be to exclude the systematically failing tests from the Gitlab CI if nobody is stepping up to fix them, the same way the d3d tests were excluded for a time.
We also now miss the Windows testbot feedback, which was definitely useful, and silently gone (as if there was no failures anymore!). It didn't show up as obviously on the MR page as it wasn't impacting the MR status. It would be nice to have it back, and I think it should change the MR status too (which would maybe require a different and better Gitlab integration).
Have someone dedicated to tracking the source of new failures, reporting them to the authors, following up on the failures, asking for progress on a fix, etc. This would be a developer role bordering on community manager.
Do away with direct commits? (see the 'New failures analysis' email)
Use the Wine party fund to pay developers to fix test bugs.
Send swag to developers who fixed 10 or more test failures. Or set rewards for fixing specific test units like user32:msg, d3d11:d3d11, etc.
Point test.winehq.org/ to the patterns page instead of the index page: the patterns page better reflects the tests progress [4] and thus is less discouraging than the main index page. https://gitlab.winehq.org/winehq/tools/-/merge_requests/71
Suggestions welcome. Hopefully we can figure something out that will get us on the path to zero failures.
You started filing bugs for new test failures, as regressions (and I got more than a few myself). I think this could be a good incentive, but as it's mixed with code regressions it is also biases toward fixing them during the code freeze. This might be alright but it also means it will take a long time before some are fixed.
Also I'm not completely sure the regression list rank is very effective, as I see it, it only means: small regressions are okay, though please fix them once a year.
If we mean to say that test failures are not okay, especially systematic ones, we should make it more obvious.
An idea would be to have a separate wall of shame for test failures. The module maintainers would be responsible for any failure that has not been blamed onto someone else, and, as you suggested above, that could then be used to decide to hold future contributions (on a module? for someone?) until the failures gets fixed.
Additionally / alternatively, instead of hinting that we should fix them during the code freeze, we could also have a dedicated time (one day? two?) on every release (month?), where only test fixes changes would be merged.
Cheers,