TL;DR; Making lasting progress on fixing the Wine tests requires TBD policy changes to incentivize working on them.
While there has definitely been progrees on the Wine tests front lately (and I'm grateful for all the Wine developpers who made it possible), I think there are structural problems that will prevent us from ever getting close to zero failures.
Why would a developer work on fixing Wine test bugs?
The naive thinking was that they would do so out of pride in the Wine code. Or that they would jump at the chance of fixing issues where all the source is available. But if those where the only factors we would not still have 200 hundred failing test units afer two decades.
Another part of the naive thinking was that if developers did not work on the tests it was just because the tools needed to be improved or the test environments were too unreliable. So for a long time all efforts have centered on that. But I think the CI is good enough now to not be the main obstacle [1].
The inconvenient truth is there are forces that discourage working on the Wine tests:
* Most developers come to Wine to fix a game or application they want to run, or to add some sexy new functionality. They will prefer to work on these over fixing tests.
* Professional developers will be asked to work on issues for paying customers, not fixing tests.
* There is a big exception to the source availability: Windows. So figuring out the logic behind the Windows behavior can still be maddeningly frustrating.
* Replicating failures can sometimes be frustratingly hard. Even if you happen to have access to the CI environment! [2]
* And usually one does not have access to the CI environment. That's the the case for the Windows environments for licensing reasons. On the Linux side the GitLab CI should mean progress but replicating it needs to be made much simpler before that's effective. In the meantime that means no debugger, no watching what happens on screen while the test runs, little poking into the test environment entrails (short of writing a dedicated program to do the poking), etc. In short it's remote debugging.
* The CI itself can be an obstacle when it systematically reports failures unrelated to the current MR (a warning light that is on all the time is no warning light at all). [3]
* The GitLab CI has been allowed to have a 99% failure rate for days on end (see failures-mr.txt in my 2023-10-20 email). That tells the developers that fixing tests is unimportant.
There is also nothing to counterbalance that and push developers to work on Wine tests. A developer will not see their patches reverted if they don't fix the failures they introduce. In fact I'm not sure introducing new failures in Wine has any negative consequences. And that's assuming the commit causing the failures is ever identified. Developers fixing tests don't get rewards either.
So the conclusion I have come to is that making further and lasting progress will require policy changes to incentivize work on the tests. This touches on the domain of social sciences so there will be no obvious 'right fix'. It's also something I cannot do myself since I don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully we can decide on something that works for Wine:
* Revert commits that introduced new failures. - Do it the very next day if the failure is systematic? - What if the failure is random and only happens once a day? Or once a week? - What if the failure does not impact the CI? For instance if the CI has no macOS test configuration and the failure only happens on macOS. - Should only the test part of the MR be reverted? (if that's the cause of the failure) - Who makes the decision to revert? Alexandre? A dedicated person who will catch a lot of flak?
* Block an author's new MRs if they did not fix failures introduced by one of their previous commit. - This has the potential to slow down Wine development. - Or the author could request their previous commit to be reverted to get unblocked.
* If the CI shows failures, block the MR. - That can still cause the Wine development to halt when the CI has a 100% failure rate (as has been the case for the GitLab CI recently). - So it's only doable if the false positive rate is pretty low. But then it's likely to just result in the developer trying their luck with the next CI run.
* If the CI shows failures, require that the author explain why they are not caused by their MR before it can be considered for merging. - The TestBot's new failure lookup tool would be a good way to prove the failure pre-existed the MR. https://testbot.winehq.org/FailuresList.pl - This is a softer version of the previous option and should not block Wine development. It may also push developers to fix the failures out of frustration at having to explain them away again and again since they cannot ignore them. - Reviewers would also be responsible for verifying that the explanations are accurate and for objecting to the merge if not. - Determining if the CI results contain new failures would not longer fall on Alexandre alone.
* Have someone dedicated to tracking the source of new failures, reporting them to the authors, following up on the failures, asking for progress on a fix, etc. This would be a developer role bordering on community manager.
* Do away with direct commits? (see the 'New failures analysis' email)
* Use the Wine party fund to pay developers to fix test bugs.
* Send swag to developers who fixed 10 or more test failures. Or set rewards for fixing specific test units like user32:msg, d3d11:d3d11, etc.
* Point test.winehq.org/ to the patterns page instead of the index page: the patterns page better reflects the tests progress [4] and thus is less discouraging than the main index page. https://gitlab.winehq.org/winehq/tools/-/merge_requests/71
Suggestions welcome. Hopefully we can figure something out that will get us on the path to zero failures.
[1] There are still a few infrastructure improvements that could help: - Fix the TestBot integration with GitLab so its reports don't go into a black hole. At least, while I cannot fix the bridge, I think I can replace it with a better solution. - Allowing to run the tests with WINEDEBUG. - Taking screenshots while the test runs. - Make it easier for developers to replicate the GitLab CI debian environment, build their Wine in it, run the tests, interact with the virtual X server, etc.
[2] I will very humbly claim that the wt-test-bisect tool represents a huge step forward on that front when the faiure only happens in full test suite runs. https://gitlab.winehq.org/fgouget/wt-daily/-/blob/bisectors/wt-test-bisect?r...
[3] Though the GitLab CI false positive rate needs to get better. However my understanding is that the strategy there is to not have the CI try to distinguish false positives from the real issues. Rather my understanding is that it's a two prong approach: 1. Summarily hide unreliable tests by marking them as flaky. 2. Let all the systematic failures annoy the developers so they fix them out of frustration. However this is essentially the naive strategy that was used for years with the TestBot and that only resulted in the developers dismissing and ignoring the CI results.
[4] For instance the patterns page makes it clear that there has been progress for Windows 10 1809 whereas that is invisible on the index page.
Hi!
On 10/23/23 14:29, Francois Gouget wrote:
TL;DR; Making lasting progress on fixing the Wine tests requires TBD policy changes to incentivize working on them.
While there has definitely been progrees on the Wine tests front lately (and I'm grateful for all the Wine developpers who made it possible), I think there are structural problems that will prevent us from ever getting close to zero failures.
Why would a developer work on fixing Wine test bugs?
The naive thinking was that they would do so out of pride in the Wine code. Or that they would jump at the chance of fixing issues where all the source is available. But if those where the only factors we would not still have 200 hundred failing test units afer two decades.
Another part of the naive thinking was that if developers did not work on the tests it was just because the tools needed to be improved or the test environments were too unreliable. So for a long time all efforts have centered on that. But I think the CI is good enough now to not be the main obstacle [1].
The inconvenient truth is there are forces that discourage working on the Wine tests:
- Most developers come to Wine to fix a game or application they want to run, or to add some sexy new functionality. They will prefer to work on these over fixing tests.
Speaking for myself I'm generally annoyed by failing tests and red MR statuses, but fixing tests *is* time and energy consuming, especially when you're fixing tests unrelated to your current work, just to get the green status back.
- Professional developers will be asked to work on issues for paying customers, not fixing tests.
Fixing tests is an essential part of fixing issues. The tests are our behavior measurement tool. If it's broken we are simply unable to measure Windows behavior and can't do any meaningful work. I think writing and fixing tests is even more part of the job when people get paid for implementing a feature?
There is a big exception to the source availability: Windows. So figuring out the logic behind the Windows behavior can still be maddeningly frustrating.
Replicating failures can sometimes be frustratingly hard. Even if you happen to have access to the CI environment! [2]
And usually one does not have access to the CI environment. That's the the case for the Windows environments for licensing reasons. On the Linux side the GitLab CI should mean progress but replicating it needs to be made much simpler before that's effective. In the meantime that means no debugger, no watching what happens on screen while the test runs, little poking into the test environment entrails (short of writing a dedicated program to do the poking), etc. In short it's remote debugging.
It would be convenient if we could run a winetest.exe command from a patch on the testbot, instead of the automatically decided test. We can run a locally built winetest.exe, but it's not as flexible as building a custom Wine, that a patch allows. This would be useful for debugging some bad tests interactions.
The CI itself can be an obstacle when it systematically reports failures unrelated to the current MR (a warning light that is on all the time is no warning light at all). [3]
The GitLab CI has been allowed to have a 99% failure rate for days on end (see failures-mr.txt in my 2023-10-20 email). That tells the developers that fixing tests is unimportant.
There is also nothing to counterbalance that and push developers to work on Wine tests. A developer will not see their patches reverted if they don't fix the failures they introduce. In fact I'm not sure introducing new failures in Wine has any negative consequences. And that's assuming the commit causing the failures is ever identified. Developers fixing tests don't get rewards either.
So the conclusion I have come to is that making further and lasting progress will require policy changes to incentivize work on the tests. This touches on the domain of social sciences so there will be no obvious 'right fix'. It's also something I cannot do myself since I don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully we can decide on something that works for Wine:
- Revert commits that introduced new failures.
- Do it the very next day if the failure is systematic?
- What if the failure is random and only happens once a day? Or once a week?
- What if the failure does not impact the CI? For instance if the CI has no macOS test configuration and the failure only happens on macOS.
- Should only the test part of the MR be reverted? (if that's the cause of the failure)
- Who makes the decision to revert? Alexandre? A dedicated person who will catch a lot of flak?
I don't think reverting commit is a good solution, We try to avoid it as much as possible in general as it makes bisection more difficult. If we could find another way it's probably better.
Block an author's new MRs if they did not fix failures introduced by one of their previous commit.
- This has the potential to slow down Wine development.
- Or the author could request their previous commit to be reverted to get unblocked.
If the CI shows failures, block the MR.
- That can still cause the Wine development to halt when the CI has a 100% failure rate (as has been the case for the GitLab CI recently).
- So it's only doable if the false positive rate is pretty low. But then it's likely to just result in the developer trying their luck with the next CI run.
If the CI shows failures, require that the author explain why they are not caused by their MR before it can be considered for merging.
- The TestBot's new failure lookup tool would be a good way to prove the failure pre-existed the MR. https://testbot.winehq.org/FailuresList.pl
- This is a softer version of the previous option and should not block Wine development. It may also push developers to fix the failures out of frustration at having to explain them away again and again since they cannot ignore them.
- Reviewers would also be responsible for verifying that the explanations are accurate and for objecting to the merge if not.
- Determining if the CI results contain new failures would not longer fall on Alexandre alone.
The move to Gitlab and merge requests supposedly also involved relying more on the CI status, and MR were supposedly not going to be merged on failure. I think it's the way to go, except held for some time, with a few exceptions to the rule that broke all MR from time to time.
IMO that rule was good, and that we should stick to it more closely. The issue now is that recently, all MR have started to systematically fail, so following that rule we'd not merge anything.
A good solution to that, in order not to block further progress, would be to exclude the systematically failing tests from the Gitlab CI if nobody is stepping up to fix them, the same way the d3d tests were excluded for a time.
We also now miss the Windows testbot feedback, which was definitely useful, and silently gone (as if there was no failures anymore!). It didn't show up as obviously on the MR page as it wasn't impacting the MR status. It would be nice to have it back, and I think it should change the MR status too (which would maybe require a different and better Gitlab integration).
Have someone dedicated to tracking the source of new failures, reporting them to the authors, following up on the failures, asking for progress on a fix, etc. This would be a developer role bordering on community manager.
Do away with direct commits? (see the 'New failures analysis' email)
Use the Wine party fund to pay developers to fix test bugs.
Send swag to developers who fixed 10 or more test failures. Or set rewards for fixing specific test units like user32:msg, d3d11:d3d11, etc.
Point test.winehq.org/ to the patterns page instead of the index page: the patterns page better reflects the tests progress [4] and thus is less discouraging than the main index page. https://gitlab.winehq.org/winehq/tools/-/merge_requests/71
Suggestions welcome. Hopefully we can figure something out that will get us on the path to zero failures.
You started filing bugs for new test failures, as regressions (and I got more than a few myself). I think this could be a good incentive, but as it's mixed with code regressions it is also biases toward fixing them during the code freeze. This might be alright but it also means it will take a long time before some are fixed.
Also I'm not completely sure the regression list rank is very effective, as I see it, it only means: small regressions are okay, though please fix them once a year.
If we mean to say that test failures are not okay, especially systematic ones, we should make it more obvious.
An idea would be to have a separate wall of shame for test failures. The module maintainers would be responsible for any failure that has not been blamed onto someone else, and, as you suggested above, that could then be used to decide to hold future contributions (on a module? for someone?) until the failures gets fixed.
Additionally / alternatively, instead of hinting that we should fix them during the code freeze, we could also have a dedicated time (one day? two?) on every release (month?), where only test fixes changes would be merged.
Cheers,
On Montag, 23. Oktober 2023 17:58:50 CEST Rémi Bernon wrote:
You started filing bugs for new test failures, as regressions (and I got more than a few myself). I think this could be a good incentive, but as it's mixed with code regressions it is also biases toward fixing them during the code freeze. This might be alright but it also means it will take a long time before some are fixed.
While we're at it, maybe a new "wine test suite bug" for the bugtracker? "testcase" is similar, but not quite the same.
Regards, Fabian Maurer
On 23/10/2023 18:58, Rémi Bernon wrote:
You started filing bugs for new test failures, as regressions (and I got more than a few myself). I think this could be a good incentive, but as it's mixed with code regressions it is also biases toward fixing them during the code freeze. This might be alright but it also means it will take a long time before some are fixed.
Also I'm not completely sure the regression list rank is very effective, as I see it, it only means: small regressions are okay, though please fix them once a year.
I think this is the wrong approach. I personally take regressions very seriously, and as a priority if it's something I screwed up.
Regressions can be pretty serious (unless it's some flaky test I guess), I think they should *always* be prioritized, not just deferred to code freeze, test failures or not.
It's not just for the end users who have to suffer with the regression or have to revert the blamed commit if they want to keep updating Wine. It's also for you, because usually the faster you get to the regression, the better your mind is set to that code (assuming it was filed relatively recently after it's been broken), so you have better idea at that point how it works, and maybe why it got broken, than several months later down the line.
And also it's important indirectly for users to update (or downstream) so they can report regressions earlier than once a year. Otherwise too much gets piled up. And I know some people are scared to update because of regressions and "developers" not taking the time to fix them after breaking it, so they have reports sitting for months, which is extremely frustrating (and well, developers can be users too, maybe not in the same area/module of what was broken though).
On 10/23/23 20:44, Gabriel Ivăncescu wrote:
On 23/10/2023 18:58, Rémi Bernon wrote:
You started filing bugs for new test failures, as regressions (and I got more than a few myself). I think this could be a good incentive, but as it's mixed with code regressions it is also biases toward fixing them during the code freeze. This might be alright but it also means it will take a long time before some are fixed.
Also I'm not completely sure the regression list rank is very effective, as I see it, it only means: small regressions are okay, though please fix them once a year.
I think this is the wrong approach. I personally take regressions very seriously, and as a priority if it's something I screwed up.
Regressions can be pretty serious (unless it's some flaky test I guess), I think they should *always* be prioritized, not just deferred to code freeze, test failures or not.
It's not just for the end users who have to suffer with the regression or have to revert the blamed commit if they want to keep updating Wine. It's also for you, because usually the faster you get to the regression, the better your mind is set to that code (assuming it was filed relatively recently after it's been broken), so you have better idea at that point how it works, and maybe why it got broken, than several months later down the line.
And also it's important indirectly for users to update (or downstream) so they can report regressions earlier than once a year. Otherwise too much gets piled up. And I know some people are scared to update because of regressions and "developers" not taking the time to fix them after breaking it, so they have reports sitting for months, which is extremely frustrating (and well, developers can be users too, maybe not in the same area/module of what was broken though).
Of course, I didn't mean that regressions should not be taken seriously. I also try to fix them as soon as possible if it's obviously broken and unusable.
I only meant that for the most elusive ones, less impacting, we have a dedicated time to spend to fix them. And random / rare test failures could probably be seen as such. If all regressions were considered as highest priority, code freeze would be unnecessary.
FWIW my motivation for working on user32:msg is as follows:
I want to implement MSAA events because I am using them in a program I'm developing separately, that I want to work in Wine. For that, I need to work with the user32:msg tests. So by doing this work, I end up with a cleaner test unit to work with, and I demonstrate that I'm willing to put in the energy to fix the things I'm working on.
Trouble is, that only makes sense if I'm going to be sticking with it long-term. There's an initial learning investment before I can really work on it. I've found that it's rarely useful for me to look at tests in areas where I'm not especially knowledgeable. I simply don't know enough to understand them and work on a credible solution. So from my perspective it's not an incentive problem as I've not even gotten to the point of sending changes in the past.
On Monday, 23 October 2023 07:29:50 CDT Francois Gouget wrote:
Another part of the naive thinking was that if developers did not work on the tests it was just because the tools needed to be improved or the test environments were too unreliable. So for a long time all efforts have centered on that. But I think the CI is good enough now to not be the main obstacle [1].
I would agree that the tools are perfectly fine nowadays (well, except for the abandonment of the TestBot... but in terms of finding bugs to fix, the tools are fine).
I do quite like the patterns page. Though at this point, when I take time to fix tests, I find what's most helpful is the already filed bugs. How much effort do those bugs take to file? Is that sustainable (and considering other tasks on your plate)? Is that a job you're comfortable with doing?
So the conclusion I have come to is that making further and lasting progress will require policy changes to incentivize work on the tests. This touches on the domain of social sciences so there will be no obvious 'right fix'. It's also something I cannot do myself since I don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully we can decide on something that works for Wine:
Revert commits that introduced new failures.
- Do it the very next day if the failure is systematic?
- What if the failure is random and only happens once a day? Or once a week?
- What if the failure does not impact the CI? For instance if the CI has no macOS test configuration and the failure only happens on macOS.
- Should only the test part of the MR be reverted? (if that's the cause of the failure)
- Who makes the decision to revert? Alexandre? A dedicated person who will catch a lot of flak?
Block an author's new MRs if they did not fix failures introduced by one of their previous commit.
- This has the potential to slow down Wine development.
- Or the author could request their previous commit to be reverted to get unblocked.
If the CI shows failures, block the MR.
- That can still cause the Wine development to halt when the CI has a 100% failure rate (as has been the case for the GitLab CI recently).
- So it's only doable if the false positive rate is pretty low. But then it's likely to just result in the developer trying their luck with the next CI run.
I think we need something along the lines of blocking new patches, or reverting blamed commits. Nothing less drastic has worked.
There are a lot of tests that are caused by unknown causes, though. Whose responsibility is it to fix them? Blocking *all* merge requests on those grounds doesn't solve that problem, but we will need a strong enough incentive to make sure that that person fixes the bugs.
- If the CI shows failures, require that the author explain why they are not caused by their MR before it can be considered for merging.
- The TestBot's new failure lookup tool would be a good way to prove the failure pre-existed the MR. https://testbot.winehq.org/FailuresList.pl
- This is a softer version of the previous option and should not block Wine development. It may also push developers to fix the failures out of frustration at having to explain them away again and again since they cannot ignore them.
- Reviewers would also be responsible for verifying that the explanations are accurate and for objecting to the merge if not.
- Determining if the CI results contain new failures would not longer fall on Alexandre alone.
I think we had a (unofficial?) policy along these lines for some time. It may have helped avoid regressions, but it didn't result in existing failures geting fixed.
- Have someone dedicated to tracking the source of new failures, reporting them to the authors, following up on the failures, asking for progress on a fix, etc. This would be a developer role bordering on community manager.
I suppose you've been doing at least part of that, but even with a stronger role, I suspect we also need a way to prevent developers from saying "I don't have time to fix that".
Use the Wine party fund to pay developers to fix test bugs.
Send swag to developers who fixed 10 or more test failures. Or set rewards for fixing specific test units like user32:msg, d3d11:d3d11, etc.
At least personally this will be no motivation at all.
I do really want to fix tests, but I find it hard to justify spending work time to fix them in most cases, and while I do work on Wine in my free time, tests are not the only thing I want to spend time on.
- Point test.winehq.org/ to the patterns page instead of the index page: the patterns page better reflects the tests progress [4] and thus is less discouraging than the main index page. https://gitlab.winehq.org/winehq/tools/-/merge_requests/71
I think this should be done, it's a more useful view in general.
--Zeb
On Mon, 23 Oct 2023, Zeb Figura wrote: [...]
I do quite like the patterns page. Though at this point, when I take time to fix tests, I find what's most helpful is the already filed bugs. How much effort do those bugs take to file?
It can be quite time consuming depending on the the volume of new failures and whether they are hard to track down. The procedure I go through looks something like this:
* First I try to identify a group of related failures. Usually that's easy but it can be confusing when there are a lot of non-systematic new failures mixed with lots of pre-existing failures. Also I sometimes don't know enough about the test to know if it will be possible to fix all the failures in one go or if some will require a separate fix. I usually try to err on the side of not mixing things up. Developers should feel free to mark bugs as duplicates when appropriate.
* How to reproduce the bug - That can be tricky when the test does not fail on its own because then I have to figure out which other test is interfering (and that can be a dead end). - Also when the test does not always fail bisects get more complicated.
* Identify the commit that caused the test to fail. - Only doable for the machines I have access to. That makes macOS failures, for instance, easier to deal with since I can just skip this step (and many others). - But identifying the commit helps figure out who is most likely to know what's going on and how to fix the issue so I feel it's an important step.
* Identify the date of the first failure. - Sometimes it's obvious from the test pattern page. - But when the test unit already has lots of failures I grep a mirror of the test.winehq.org reports (sorted by date). (I also use the mirror to build myself a patterns page with 8 months of history).
* And then there is the question of identifying which tests need to be looked at: - I scan all the TestBot's WineTest job reports (ideally daily and update failures-winetest.txt). The TestBot is now quite good at identifying the new failures so on good days that's fast. On bad days there are a lot of reports to look at.
That's the most efficient way to get a list of new failures, but only for those happening _in the TestBot_.
I usually try to file a bug as soon as possible so I can update the failures page and be sure the TestBot will not report the failure as new again.
Also the TestBot automatically identifies unchanging failure messages and does not report them as new on the following days. That can lead on to think a failure was a one-off when in fact it is happening systematically.
- I also scan the last job of all MRs to identify which failures were present (and update failures-mr.txt in the process): those are the failures that are not considered to be new (otherwise the MR should not have been merged). When it's all green this is obviously fast but otherwise it requires looking at all the logs. If a failure happens only once it may not be worth reporting. But in failures-mr.txt I can see which ones are most common and I try to report those first.
This also allows me to identify failures that only happen in the GitLab CI and not in full WineTest runs.
- And from time to time I just go through the patterns page to identify non-TestBot, non-GitLab CI new failures such as those that happen on Remi's boxes or mine.
Scanning the pattern page takes more time so I don't do it as often.
* That's all for reporting new failures but sometimes failures get fixed without the bug being closed (which is quite understandable: the developer may just not be aware of the bug).
That does not have much of a negative impact on the TestBot so I give closing bug a much lower priority (it can artificially inflate the failure modes number on the patterns page though).
Closing these mostly involves looking at the TestBot's failures page and checking those that have not been matched in a while (or ever).
There's also a more interesting reason to look at those: identify the entries where I got the regexps wrong so the failures may still be reported as new (which I should notice on the failures-winetest.txt but only if the TestBot does not already identify it as old).
* Finally from time to time, less than once a month, go through the failures page to identify the entries that are not needed because the failure has been fixed.
Is that sustainable (and considering other tasks on your plate)?
When I have to focus on other things I generally have to stop looking at the tests for a while. So it's not totally sustainable.
I believe in fixing my own regressions when it clear that my patch introduced them.
As one of "those" developers whom happen to be on that list which caused the domdoc test failure. The patch passed on all the test boxes when submitted, just the CI failed. A test in the patch just happen to highlight an issue in another part of wine.
It took a long time to track the issue down and understand what was going on. Unless there is a CI testbox we can test against (easily). There are going to have to slip through, as some patch fix more than they hurt.
Who's responsibility was it to fix? (MR pending, by the way).
Stopping future MR from being committed due to a CI failures, will just drive possible developers to other projects. It's one think to fix your own patch up, its another to find/fix a complete different issue (maybe related). This scenario can lead to frustration when you initial 2 line patch requires another 10 patches before it can be accepted. Many will give up, "To Hard basket" and the bug will live for another 10 years.
Reverting patches isn't the answer and will just cause more pain than it's worth. 1. Bugzilla tracking, fix in x, reverted in y, now broken again. (what nightmare are made of). 2. Will make read git commits log more confusing.
Other options to consider * Have regular, regression/test fixes only release? (every Quarter?) * Have the paid wine developers work on them? ( *ducks* for cover. )*
* There seems to be more and more paid wine developers appearing every week. So might be a short term solution to make it possible.
Regards Alistair.
On Mon, Oct 23, 2023 at 6:31 AM Francois Gouget fgouget@free.fr wrote:
So the conclusion I have come to is that making further and lasting progress will require policy changes to incentivize work on the tests. This touches on the domain of social sciences so there will be no obvious 'right fix'. It's also something I cannot do myself since I don't have any authority to make policy changes.
Anyway, here are a few ideas, including some extreme ones, and hopefully we can decide on something that works for Wine:
- Use the Wine party fund to pay developers to fix test bugs.
Ooh, I like that idea. Where do I sign up?
-Alex
Other suggestions I got:
* When a test fails systematically in the GitLab CI, stop running it when testing MRs (i.e. add it to the EXCLUDE_TESTS list). It won't help getting it fixed but at least it will stop it from polluting the CI results.
The downside is we won't notice if the excluded test starts getting even more failures.
* One could also block commits to that module on the basis that it's impossible to know if they are good since they don't go through that module's tests. But that would probably impact development too much.
* Systematically CC a module's maintainer on bugs about test failures.
* Also CC the "Other knowledgeable persons" if any?
On Thu, Oct 26, 2023 at 8:38 AM Francois Gouget fgouget@codeweavers.com wrote:
- One could also block commits to that module on the basis that it's impossible to know if they are good since they don't go through that module's tests. But that would probably impact development too much.
This might be reasonable for modules with active maintainers only - if a maintainer is reviewing and approving MRs for the component they should be willing to fix the CI.
Systematically CC a module's maintainer on bugs about test failures.
Also CC the "Other knowledgeable persons" if any?
I see no reason not to do that. Again, maintainers should be willing to fix the CI. If it's really unfixable, they can make a case for excluding the test unit. "Other knowledgeable persons" may not have an expectation to fix it but will likely want to know anyway.
On Mon, 23 Oct 2023, Francois Gouget wrote: [...]
So the conclusion I have come to is that making further and lasting progress will require policy changes to incentivize work on the tests.
* The goal is to get rid of the TestBot so the bridge to GitLab will not be restored in any form. There will also be no further TestBot development.
* Wine developers are on their own to figure out if their commits cause new failures, and if they do they are on their own to figure out how to reproduce the issues.
* The other policy changes are still to be determined.
Francois Gouget fgouget@codeweavers.com writes:
- The other policy changes are still to be determined.
Thank you for raising the issue. I understand your frustration that after all these years we still can't achieve zero failures, but I'm not envisioning drastic policy changes at this point, particularly not extra constraints on developers.
We all know that fixing tests can be hard even for the best developers, so punishing people for not fixing tests, by reverting their changes or blocking their MRs, would be counterproductive IMHO.
There's a balance to strike between the efforts needed to fix the tests, and the efforts needed to develop new features and fix bugs in real apps. A successful test suite is important, but it's not important enough to sacrifice progress in other areas.
We also have to look at the positive side: we have millions of tests that are being executed and succeed on every commit, and they are catching many potential regressions. While the failures are annoying, they are only a tiny percentage of the tests, and they don't make the whole test suite useless. That's also why I don't want to disable an entire test file when only a couple of tests fail. However, it's OK to disable specific tests, using flaky() or todo(), if they are too hard to fix properly.
So I think we have to accept that test failures and regressions are the price to pay for making progress. We all need to do our best to minimize them and keep them under control, but given the complexity of what we are doing with Wine, we can't always achieve perfection. I'm also hoping that, as the pool of paid developers grows, we will have more room to allocate some paid developer time to fixing the tests.
Hi,
Il 23/10/23 14:29, Francois Gouget ha scritto:
Replicating failures can sometimes be frustratingly hard. Even if you happen to have access to the CI environment! [2]
And usually one does not have access to the CI environment. That's the the case for the Windows environments for licensing reasons. On the Linux side the GitLab CI should mean progress but replicating it needs to be made much simpler before that's effective. In the meantime that means no debugger, no watching what happens on screen while the test runs, little poking into the test environment entrails (short of writing a dedicated program to do the poking), etc. In short it's remote debugging.
This is indeed a pain point. So much that at some point I even began to write a tentative solution to that, which so far I have only shared inside CodeWeavers. There is in fact (as far as I am aware) no reason why it should not be made available publicly, so here it is, I just released it:
https://gitlab.winehq.org/giomasce/minitestbot
Here is the train of thoughts that brought to MiniTestBot:
* As you say, accessing the TestBot/CI environment is hard: the interface is limited and people have to share the available time, so if many people are hitting the TestBot/CI at the same time they have to wait for each other's job, which can be frustrating.
* Much better it would be if you could run the TestBot/CI's environment directly on your hardware. Then you don't compete with others anymore and you can interact with your VM however you please, including changing settings and installing stuff.
* As you say, the problem is that we can't just distribute Windows images. But we can distribute the scripts used to generate them, which is essentially what MiniTestBot does! The MiniTestBot user needs to independently provide the Windows ISO (Microsoft distributes them for free, at least for some Windows versions) and is directly responsible for complying with licenses, while distributing the scripts does not infringe any copyright, as they are all free software and free of Microsoft intellectual property. Though I am not a lawyer, so don't take this as legal advice! :-)
* If the TestBot/CI used images generated with these scripts (or, rather, what they would become after having made better) we'd have both the possibility to run tests in the cloud, which is handy for many reasons, and the possibility to run them on premises in the same environment, which is handy in other cases.
I hope that would help to remove at least some of the frustration of debugging tests on Windows.
In practice MiniTestBot is a few scripts that take a Windows image and (nearly) automatically create an image out of it, together with other scripts to submit job to be automatically be executed inside that image, TestBot style.
It is really a proof of concept at this stage, but I'm already finding it useful for development. At least another person in CodeWeavers told me they find it useful. It would require some more work to be integrated in the TestBot/CI, but that work shouldn't be impossible to do.
What do you think of this approach?
Also, I am happy to receive comments, suggestions and MRs (yes, those too!) for MiniTestBot.
Gio.