Re: A case for policy changes to get to zero test failures

23 Oct 2023

      Hi!
On 10/23/23 14:29, Francois Gouget wrote:
...
TL;DR; Making lasting progress on fixing the Wine tests requires TBD
policy changes to incentivize working on them.
While there has definitely been progrees on the Wine tests front lately
(and I'm grateful for all the Wine developpers who made it possible), I
think there are structural problems that will prevent us from ever
getting  close to zero failures.
Why would a developer work on fixing Wine test bugs?
The naive thinking was that they would do so out of pride in the Wine
code. Or that they would jump at the chance of fixing issues where all
the source is available. But if those where the only factors we would
not still have 200 hundred failing test units afer two decades.
Another part of the naive thinking was that if developers did not work
on the tests it was just because the tools needed to be improved or the
test environments were too unreliable. So for a long time all efforts
have centered on that. But I think the CI is good enough now to not be
the main obstacle [1].
The inconvenient truth is there are forces that discourage working on
the Wine tests:

Most developers come to Wine to fix a game or application they want to
 run, or to add some sexy new functionality. They will prefer to work
 on these over fixing tests.

Speaking for myself I'm generally annoyed by failing tests and red MR 
statuses, but fixing tests *is* time and energy consuming, especially 
when you're fixing tests unrelated to your current work, just to get the 
green status back.
...

Professional developers will be asked to work on issues for paying
 customers, not fixing tests.

Fixing tests is an essential part of fixing issues. The tests are our 
behavior measurement tool. If it's broken we are simply unable to 
measure Windows behavior and can't do any meaningful work. I think 
writing and fixing tests is even more part of the job when people get 
paid for implementing a feature?
...

There is a big exception to the source availability: Windows. So
 figuring out the logic behind the Windows behavior can still be
 maddeningly frustrating.

Replicating failures can sometimes be frustratingly hard. Even if you
 happen to have access to the CI environment! [2]

And usually one does not have access to the CI environment. That's the
 the case for the Windows environments for licensing reasons. On the
 Linux side the GitLab CI should mean progress but replicating it needs
 to be made much simpler before that's effective. In the meantime that
 means no debugger, no watching what happens on screen while the test
 runs, little poking into the test environment entrails (short of
 writing a dedicated program to do the poking), etc. In short it's
 remote debugging.

It would be convenient if we could run a winetest.exe command from a 
patch on the testbot, instead of the automatically decided test. We can 
run a locally built winetest.exe, but it's not as flexible as building a 
custom Wine, that a patch allows. This would be useful for debugging 
some bad tests interactions.
...

The CI itself can be an obstacle when it systematically reports
 failures unrelated to the current MR (a warning light that is on all
 the time is no warning light at all). [3]

The GitLab CI has been allowed to have a 99% failure rate for days on
 end (see failures-mr.txt in my 2023-10-20 email). That tells the
 developers that fixing tests is unimportant.

There is also nothing to counterbalance that and push developers to work
on Wine tests. A developer will not see their patches reverted if they
don't fix the failures they introduce. In fact I'm not sure introducing
new failures in Wine has any negative consequences. And that's assuming
the commit causing the failures is ever identified. Developers fixing
tests don't get rewards either.
So the conclusion I have come to is that making further and lasting
progress will require policy changes to incentivize work on the tests.
This touches on the domain of social sciences so there will be no
obvious 'right fix'. It's also something I cannot do myself since I
don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully
we can decide on something that works for Wine:

Revert commits that introduced new failures.
Do it the very next day if the failure is systematic?
What if the failure is random and only happens once a day? Or once a
week?
What if the failure does not impact the CI? For instance if the CI
has no macOS test configuration and the failure only happens on
macOS.
Should only the test part of the MR be reverted? (if that's the
cause of the failure)
Who makes the decision to revert? Alexandre? A dedicated person who
will catch a lot of flak?

I don't think reverting commit is a good solution, We try to avoid it as 
much as possible in general as it makes bisection more difficult. If we 
could find another way it's probably better.
...

Block an author's new MRs if they did not fix failures introduced by
 one of their previous commit.

This has the potential to slow down Wine development.
Or the author could request their previous commit to be reverted to
get unblocked.

If the CI shows failures, block the MR.

That can still cause the Wine development to halt when the CI has a
100% failure rate (as has been the case for the GitLab CI recently).
So it's only doable if the false positive rate is pretty low. But
then it's likely to just result in the developer trying their luck
with the next CI run.

If the CI shows failures, require that the author explain why they are
 not caused by their MR before it can be considered for merging.

The TestBot's new failure lookup tool would be a good way to prove
the failure pre-existed the MR.
https://testbot.winehq.org/FailuresList.pl
This is a softer version of the previous option and should not block
Wine development. It may also push developers to fix the failures
out of frustration at having to explain them away again and again
since they cannot ignore them.
Reviewers would also be responsible for verifying that the
explanations are accurate and for objecting to the merge if not.
Determining if the CI results contain new failures would not longer
fall on Alexandre alone.

The move to Gitlab and merge requests supposedly also involved relying 
more on the CI status, and MR were supposedly not going to be merged on 
failure. I think it's the way to go, except held for some time, with a 
few exceptions to the rule that broke all MR from time to time.
IMO that rule was good, and that we should stick to it more closely. The 
issue now is that recently, all MR have started to systematically fail, 
so following that rule we'd not merge anything.
A good solution to that, in order not to block further progress, would 
be to exclude the systematically failing tests from the Gitlab CI if 
nobody is stepping up to fix them, the same way the d3d tests were 
excluded for a time.
We also now miss the Windows testbot feedback, which was definitely 
useful, and silently gone (as if there was no failures anymore!). It 
didn't show up as obviously on the MR page as it wasn't impacting the MR 
status. It would be nice to have it back, and I think it should change 
the MR status too (which would maybe require a different and better 
Gitlab integration).
...

Have someone dedicated to tracking the source of new failures,
 reporting them to the authors, following up on the failures, asking
 for progress on a fix, etc. This would be a developer role bordering
 on community manager.

Do away with direct commits? (see the 'New failures analysis' email)

Use the Wine party fund to pay developers to fix test bugs.

Send swag to developers who fixed 10 or more test failures. Or set
 rewards for fixing specific test units like user32:msg, d3d11:d3d11,
 etc.

Point test.winehq.org/ to the patterns page instead of the index page:
 the patterns page better reflects the tests progress [4] and thus is
 less discouraging than the main index page.
 https://gitlab.winehq.org/winehq/tools/-/merge_requests/71

Suggestions welcome. Hopefully we can figure something out that will get
us on the path to zero failures.
You started filing bugs for new test failures, as regressions (and I got 
more than a few myself). I think this could be a good incentive, but as 
it's mixed with code regressions it is also biases toward fixing them 
during the code freeze. This might be alright but it also means it will 
take a long time before some are fixed.
Also I'm not completely sure the regression list rank is very effective, 
as I see it, it only means: small regressions are okay, though please 
fix them once a year.
If we mean to say that test failures are not okay, especially systematic 
ones, we should make it more obvious.
An idea would be to have a separate wall of shame for test failures. The 
module maintainers would be responsible for any failure that has not 
been blamed onto someone else, and, as you suggested above, that could 
then be used to decide to hold future contributions (on a module? for 
someone?) until the failures gets fixed.
Additionally / alternatively, instead of hinting that we should fix them 
during the code freeze, we could also have a dedicated time (one day? 
two?) on every release (month?), where only test fixes changes would be 
merged.
Cheers,
-- 
Rémi Bernon rbernon@codeweavers.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: A case for policy changes to get to zero test failures