A case for policy changes to get to zero test failures

23 Oct 2023


      TL;DR; Making lasting progress on fixing the Wine tests requires TBD 
policy changes to incentivize working on them.
While there has definitely been progrees on the Wine tests front lately 
(and I'm grateful for all the Wine developpers who made it possible), I 
think there are structural problems that will prevent us from ever 
getting  close to zero failures.
Why would a developer work on fixing Wine test bugs?
The naive thinking was that they would do so out of pride in the Wine 
code. Or that they would jump at the chance of fixing issues where all 
the source is available. But if those where the only factors we would 
not still have 200 hundred failing test units afer two decades.
Another part of the naive thinking was that if developers did not work 
on the tests it was just because the tools needed to be improved or the 
test environments were too unreliable. So for a long time all efforts 
have centered on that. But I think the CI is good enough now to not be 
the main obstacle [1].
The inconvenient truth is there are forces that discourage working on 
the Wine tests:
* Most developers come to Wine to fix a game or application they want to 
  run, or to add some sexy new functionality. They will prefer to work 
  on these over fixing tests.
* Professional developers will be asked to work on issues for paying 
  customers, not fixing tests.
* There is a big exception to the source availability: Windows. So 
  figuring out the logic behind the Windows behavior can still be 
  maddeningly frustrating.
* Replicating failures can sometimes be frustratingly hard. Even if you 
  happen to have access to the CI environment! [2]
* And usually one does not have access to the CI environment. That's the 
  the case for the Windows environments for licensing reasons. On the 
  Linux side the GitLab CI should mean progress but replicating it needs 
  to be made much simpler before that's effective. In the meantime that 
  means no debugger, no watching what happens on screen while the test 
  runs, little poking into the test environment entrails (short of 
  writing a dedicated program to do the poking), etc. In short it's 
  remote debugging.
* The CI itself can be an obstacle when it systematically reports 
  failures unrelated to the current MR (a warning light that is on all 
  the time is no warning light at all). [3]
* The GitLab CI has been allowed to have a 99% failure rate for days on 
  end (see failures-mr.txt in my 2023-10-20 email). That tells the 
  developers that fixing tests is unimportant.
There is also nothing to counterbalance that and push developers to work 
on Wine tests. A developer will not see their patches reverted if they 
don't fix the failures they introduce. In fact I'm not sure introducing 
new failures in Wine has any negative consequences. And that's assuming 
the commit causing the failures is ever identified. Developers fixing 
tests don't get rewards either.
So the conclusion I have come to is that making further and lasting 
progress will require policy changes to incentivize work on the tests. 
This touches on the domain of social sciences so there will be no 
obvious 'right fix'. It's also something I cannot do myself since I 
don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully 
we can decide on something that works for Wine:
* Revert commits that introduced new failures.
  - Do it the very next day if the failure is systematic?
  - What if the failure is random and only happens once a day? Or once a 
    week?
  - What if the failure does not impact the CI? For instance if the CI 
    has no macOS test configuration and the failure only happens on 
    macOS.
  - Should only the test part of the MR be reverted? (if that's the 
    cause of the failure)
  - Who makes the decision to revert? Alexandre? A dedicated person who 
    will catch a lot of flak?
* Block an author's new MRs if they did not fix failures introduced by 
  one of their previous commit.
  - This has the potential to slow down Wine development.
  - Or the author could request their previous commit to be reverted to 
    get unblocked.
* If the CI shows failures, block the MR.
  - That can still cause the Wine development to halt when the CI has a 
    100% failure rate (as has been the case for the GitLab CI recently).
  - So it's only doable if the false positive rate is pretty low. But 
    then it's likely to just result in the developer trying their luck 
    with the next CI run.
* If the CI shows failures, require that the author explain why they are 
  not caused by their MR before it can be considered for merging.
  - The TestBot's new failure lookup tool would be a good way to prove 
    the failure pre-existed the MR.
    https://testbot.winehq.org/FailuresList.pl
  - This is a softer version of the previous option and should not block 
    Wine development. It may also push developers to fix the failures 
    out of frustration at having to explain them away again and again 
    since they cannot ignore them.
  - Reviewers would also be responsible for verifying that the 
    explanations are accurate and for objecting to the merge if not.
  - Determining if the CI results contain new failures would not longer 
    fall on Alexandre alone.
* Have someone dedicated to tracking the source of new failures, 
  reporting them to the authors, following up on the failures, asking 
  for progress on a fix, etc. This would be a developer role bordering 
  on community manager.
* Do away with direct commits? (see the 'New failures analysis' email)
* Use the Wine party fund to pay developers to fix test bugs.
* Send swag to developers who fixed 10 or more test failures. Or set 
  rewards for fixing specific test units like user32:msg, d3d11:d3d11, 
  etc.
* Point test.winehq.org/ to the patterns page instead of the index page: 
  the patterns page better reflects the tests progress [4] and thus is 
  less discouraging than the main index page.
  https://gitlab.winehq.org/winehq/tools/-/merge_requests/71
Suggestions welcome. Hopefully we can figure something out that will get 
us on the path to zero failures.
[1] There are still a few infrastructure improvements that could help:
    - Fix the TestBot integration with GitLab so its reports don't go 
      into a black hole. At least, while I cannot fix the bridge, I 
      think I can replace it with a better solution.
    - Allowing to run the tests with WINEDEBUG.
    - Taking screenshots while the test runs.
    - Make it easier for developers to replicate the GitLab CI debian 
      environment, build their Wine in it, run the tests, interact with 
      the virtual X server, etc.
[2] I will very humbly claim that the wt-test-bisect tool represents a 
    huge step forward on that front when the faiure only happens in full 
    test suite runs.
    https://gitlab.winehq.org/fgouget/wt-daily/-/blob/bisectors/wt-test-bisect?r...
[3] Though the GitLab CI false positive rate needs to get better.
    However my understanding is that the strategy there is to not have 
    the CI try to distinguish false positives from the real issues. 
    Rather my understanding is that it's a two prong approach:
    1. Summarily hide unreliable tests by marking them as flaky.
    2. Let all the systematic failures annoy the developers so they fix 
       them out of frustration. However this is essentially the naive 
       strategy that was used for years with the TestBot and that only 
       resulted in the developers dismissing and ignoring the CI 
       results.
[4] For instance the patterns page makes it clear that there has been 
    progress for Windows 10 1809 whereas that is invisible on the index 
    page.
-- 
Francois Gouget fgouget@free.fr              http://fgouget.free.fr/
             Theory is where you know everything but nothing works.
            Practice is where everything works but nobody knows why.
      Sometimes they go hand in hand: nothing works and nobody knows why.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

A case for policy changes to get to zero test failures