So I'm resubmitting this patch hoping for some feedback since the last time it just dropped off the mailing list [1].
The issue is that some test failures include values that change on every run which prevents the TestBot from knowing whether they are new or not. In turn this means the TestBot systematically blames the author of whichever patch is being tested for these failures.
Here are some examples:
* user32:msg - Test failed: hwnd 0028050C message 0738
The message value is important: an unexpected 0738 message has a completely different cause than an unexpected WM_MOUSEFIRST message. So the TestBot should not consider a failure about message 0738 to be identical to a failure about message 0200.
But the value of the window handle does not change the nature of the message. So a failure about hwnd 0028050C is identical to one about hwnd 0043050E.
* kernel32:comm - Test failed: WaitCommEvent used 1594 ms for waiting
WaitCommEvent() took longer than expected and knowing by how much can inform on the right approach to deal with it. But a failure with a 1594 ms overrun is the same as one with a 869 ms overrun.
* comctl32:datetime.c - Test failed: Expected 24/03/2020, got 2020/03/24
The failure message obviously changes every day (because this test can only be about the current date), and yet is the same.
Because a basic string comparison of past failures to the current one always show they are different, the TestBot systematically claims that any patch involving these test units is bad. As anyone who cries wolf all the time, it ends up being ignored and thus is not serving its purpose.
So the solution I propose is to add some delimiter around the variable parts so the TestBot can identify and ignore them when comparing failures. This way it will be able to identify the really meaningful changes.
For instance one could enclose the irrelevant parts in double-parentheses [2] as follows:
Test failed: hwnd ((0028050C)) message 0738 Test failed: WaitCommEvent used ((1594)) ms for waiting Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24)) [3]
Pros: * This provides a quick way to significantly improve the TestBot results. Where fixing each and every one of these failures would likely take weeks if not months, adding parentheses where relevant could likely be done by a single developer (probably me) in a week or so.
* With the "always new" failures out of the picture the Wine developers can focus on the rare intermittent ones, for which there is no quick fix. (Rare intermittent failures are those that happen less than 5% of the time and which are thus not present in a given test configuration's WineTest results history, but which still have a siginificant chance of happening when a patch is tested against 12 or more configurations: 3% < 5% but 12 * 3% = 36%);
* If the only way to prevent these failures from showing up as "always new" is to fix the failure, it can be tempting to just remove the test or skip it in the cases where it fails (e.g. specific locales). This could mean losing valuable information about how Windows behaves.
* We will get a steady stream of "always new" failures as new tests are added and the test configurations change (e.g. new Windows versions). Delimiting the variable parts provides a timely way to deal with these new issues.
* The TestBot patch is really simple (single-liner, 4 lines with comments).
Cons: * Once developers no longer get blamed all the time for these failures they may have less incentive to fix them.
-> I don't really buy that argument. Wine developers have shown they can very well ignore these test failures for years and there is no consequence to doing so anyway.
* There is a risk of failure messages containing the chosen delimiters for unrelated reasons. So using double-quotes as the delimiters as in "0028050C" would be a bad idea.
-> But a carefully chosen delimiter greatly reduces that risk and in the worst case it will only cause a few failure messages to not be detected as new. [2]
* This requires adding 'tagging' to (many) failure messages.
-> While failure messages can be tagged preventively, only those that fail really must be tagged. Also I feel the 'tagging' is pretty minimal.
[1] https://www.winehq.org/pipermail/wine-devel/2019-December/157168.html
[2] I'm not dead set on double parentheses. I'm fine with anything that has a low false-positive probability. Note that this eliminates double-quotes, single parentheses "(360, 200)", curly brackets "{360, 200}", single angle brackets "exp<0000000000>", tildes (short pathnames), etc. But pretty much double-anything would work. Maybe double-angle brackets <<0028050C>>?
[3] In this last example I feel that comparing the failure based on just "Expected, got" makes it too generic.
diff --git a/testbot/lib/WineTestBot/LogUtils.pm b/testbot/lib/WineTestBot/LogUtils.pm index 27d38567ea..842ebc8808 100644 --- a/testbot/lib/WineTestBot/LogUtils.pm +++ b/testbot/lib/WineTestBot/LogUtils.pm @@ -1086,6 +1086,10 @@ sub _GetLineKey($) my ($Line) = @_; return undef if (!defined $Line);
+ # Remove variable parts. + # Use a non-greedy match to ensure the ignored part does not contain '))'. + $Line =~ s/((.+?))/(())/g; + # Remove the line number $Line =~ s/^([_a-z0-9]+.c:)\d+:( Test (?:failed|succeeded inside todo block): )/$1$2/
On 3/27/2020 11:53 AM, Francois Gouget wrote:
The issue is that some test failures include values that change on every run which prevents the TestBot from knowing whether they are new or not. In turn this means the TestBot systematically blames the author of whichever patch is being tested for these failures.
So you're wanting to add a sort of comment syntax so that TestBot can ignore parts of failure messages referring to transient data.
For instance one could enclose the irrelevant parts in double-parentheses [2] as follows:
Test failed: hwnd ((0028050C)) message 0738
For this case in particular, I don't believe it's useful to output the hwnd at all, unless that's a useful piece of information in diagnosing the test failure.
Test failed: WaitCommEvent used ((1594)) ms for waiting Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24)) [3]
I agree that the delimiter must be chosen carefully. I would prefer something like TESTBOT_IGNORE(...) which would never show up in a failure message otherwise, and would even have some possibility of being discovered by grepping the source code.
-Jefferson
On Fri, 27 Mar 2020, Jefferson Carpenter wrote: [...]
Test failed: hwnd ((0028050C)) message 0738
For this case in particular, I don't believe it's useful to output the hwnd at all, unless that's a useful piece of information in diagnosing the test failure.
I prefer removing useless information from the failure messages too but it's not always easy to determine what's useful.
Here the test traces some window handles (parent and child 1 to 4) so having the hwnd of the window the unexpected message is sent to can allow matching it to one of those... or not, which I guess is useful.
Maybe not specifically in the above message but at least in the "wrong parent/child" failure messages below.
Test failed: WaitCommEvent used ((1594)) ms for waiting Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24))
[3]
I agree that the delimiter must be chosen carefully. I would prefer something like TESTBOT_IGNORE(...) which would never show up in a failure message otherwise, and would even have some possibility of being discovered by grepping the source code.
That's a good point. "TESTBOT_IGNORE" is relatively clear but it's also really quite verbose:
ok(!IsChild(desktop, parent), "wrong parent/child %p/%p\n", desktop, parent); -> ok(!IsChild(desktop, parent), "wrong parent/child TESTBOT_IGNORE(%p/%p)\n", desktop, parent);
ok(!strcmp(buff, time), "Expected %s, got %s\n", time, buff); -> ok(!strcmp(buff, time), "Expected TESTBOT_IGNORE(%s), got TESTBOT_IGNORE(%s)\n", time, buff);
None of '((', '<<', '[[' or '{{' is very greppable because of uses in C, js and the tests. '<[' is somewhat common in XML files, so maybe a mixed form like '<{' would be the best option: git grep '<[{]' */*/tests/* returns nothing.
So we'd have:
ok(!IsChild(desktop, parent), "wrong parent/child <{%p/%p}>\n", desktop, parent); ok(!strcmp(buff, time), "Expected <{%s}>, got <{%s}>\n", time, buff);
On 3/27/20 12:53 PM, Francois Gouget wrote:
So I'm resubmitting this patch hoping for some feedback since the last time it just dropped off the mailing list [1].
The issue is that some test failures include values that change on every run which prevents the TestBot from knowing whether they are new or not. In turn this means the TestBot systematically blames the author of whichever patch is being tested for these failures.
Here are some examples:
user32:msg - Test failed: hwnd 0028050C message 0738
The message value is important: an unexpected 0738 message has a completely different cause than an unexpected WM_MOUSEFIRST message. So the TestBot should not consider a failure about message 0738 to be identical to a failure about message 0200.
But the value of the window handle does not change the nature of the message. So a failure about hwnd 0028050C is identical to one about hwnd 0043050E.
kernel32:comm - Test failed: WaitCommEvent used 1594 ms for waiting
WaitCommEvent() took longer than expected and knowing by how much can inform on the right approach to deal with it. But a failure with a 1594 ms overrun is the same as one with a 869 ms overrun.
comctl32:datetime.c - Test failed: Expected 24/03/2020, got 2020/03/24
The failure message obviously changes every day (because this test can only be about the current date), and yet is the same.
Because a basic string comparison of past failures to the current one always show they are different, the TestBot systematically claims that any patch involving these test units is bad. As anyone who cries wolf all the time, it ends up being ignored and thus is not serving its purpose.
So the solution I propose is to add some delimiter around the variable parts so the TestBot can identify and ignore them when comparing failures. This way it will be able to identify the really meaningful changes.
For instance one could enclose the irrelevant parts in double-parentheses [2] as follows:
Test failed: hwnd ((0028050C)) message 0738 Test failed: WaitCommEvent used ((1594)) ms for waiting Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24)) [3]
Pros:
This provides a quick way to significantly improve the TestBot results. Where fixing each and every one of these failures would likely take weeks if not months, adding parentheses where relevant could likely be done by a single developer (probably me) in a week or so.
With the "always new" failures out of the picture the Wine developers can focus on the rare intermittent ones, for which there is no quick fix. (Rare intermittent failures are those that happen less than 5% of the time and which are thus not present in a given test configuration's WineTest results history, but which still have a siginificant chance of happening when a patch is tested against 12 or more configurations: 3% < 5% but 12 * 3% = 36%);
If the only way to prevent these failures from showing up as "always new" is to fix the failure, it can be tempting to just remove the test or skip it in the cases where it fails (e.g. specific locales). This could mean losing valuable information about how Windows behaves.
We will get a steady stream of "always new" failures as new tests are added and the test configurations change (e.g. new Windows versions). Delimiting the variable parts provides a timely way to deal with these new issues.
The TestBot patch is really simple (single-liner, 4 lines with comments).
Cons:
Once developers no longer get blamed all the time for these failures they may have less incentive to fix them.
-> I don't really buy that argument. Wine developers have shown they can very well ignore these test failures for years and there is no consequence to doing so anyway.
There is a risk of failure messages containing the chosen delimiters for unrelated reasons. So using double-quotes as the delimiters as in "0028050C" would be a bad idea.
-> But a carefully chosen delimiter greatly reduces that risk and in the worst case it will only cause a few failure messages to not be detected as new. [2]
This requires adding 'tagging' to (many) failure messages.
-> While failure messages can be tagged preventively, only those that fail really must be tagged. Also I feel the 'tagging' is pretty minimal.
[1] https://www.winehq.org/pipermail/wine-devel/2019-December/157168.html
[2] I'm not dead set on double parentheses. I'm fine with anything that has a low false-positive probability. Note that this eliminates double-quotes, single parentheses "(360, 200)", curly brackets "{360, 200}", single angle brackets "exp<0000000000>", tildes (short pathnames), etc. But pretty much double-anything would work. Maybe double-angle brackets <<0028050C>>?
[3] In this last example I feel that comparing the failure based on just "Expected, got" makes it too generic.
diff --git a/testbot/lib/WineTestBot/LogUtils.pm b/testbot/lib/WineTestBot/LogUtils.pm index 27d38567ea..842ebc8808 100644 --- a/testbot/lib/WineTestBot/LogUtils.pm +++ b/testbot/lib/WineTestBot/LogUtils.pm @@ -1086,6 +1086,10 @@ sub _GetLineKey($) my ($Line) = @_; return undef if (!defined $Line);
- # Remove variable parts.
- # Use a non-greedy match to ensure the ignored part does not contain '))'.
- $Line =~ s/((.+?))/(())/g;
- # Remove the line number $Line =~ s/^([_a-z0-9]+.c:)\d+:( Test (?:failed|succeeded inside todo block): )/$1$2/
IIUC the fix also involves updating the failing tests message to mark some information as irrelevant to the failure itself. In this case, why not just remove that information from the message?
Because the test failure messages are used to identify individual tests and track their results in time they should probably only include meaningful information and if some additional information is still useful for debugging, add a trace right before the test to print it.
I think the main problem is that it's difficult to uniquely identify a given test, while keeping it easy to modify existing or add new ones.
Maybe using the surrounding function name and the test line number relative to it would be a good unique identifier instead. I think it would reduce the possible false positives to when tests within a function are added or reordered, which would be a good occasion to fix the failing tests.
That would require some wrapper around test functions as well to track relative line numbers, but it's maybe not as bad as going over every test message to check and remove irrelevant information.
A quick and dirty PoC: https://gcc.godbolt.org/z/3TELVD
(I haven't thought a lot about it and there's maybe some obvious traps I missed.)
Cheers,
On Fri, 27 Mar 2020, Rémi Bernon wrote: [...]
IIUC the fix also involves updating the failing tests message to mark some information as irrelevant to the failure itself. In this case, why not just remove that information from the message?
Sometimes it's needed to diagnose the failure.
Because the test failure messages are used to identify individual tests and track their results in time they should probably only include meaningful information and if some additional information is still useful for debugging, add a trace right before the test to print it.
That makes the tests more complicated: lots of ok() calls will need an associate trace() which will either push the WineTest reports above the 1.5 MB limit, or will require only tracing the value if the test fails:
else ok(!strcmp(buff, time), "Expected %s, got %s\n", time, buff);
-> else { ok(!strcmp(buff, time), "Unexpected LOCALE_USER_DEFAULT date / format\n"); trace("Expected %s, got %s\n", time, buff); }
or else if (strcmp(buff, time)) { ok(0, "Unexpected LOCALE_USER_DEFAULT date / format\n"); trace("Expected %s, got %s\n", time, buff); }
I guess one could cheat by putting a linefeed before the variable parts.
else ok(!strcmp(buff, time), "Unexpected date / format\nExpected %s, got %s\n", time, buff);
It's kind of ugly and requires formulating the failure message in a specific way but it would work I guess.
[...]
Maybe using the surrounding function name and the test line number relative to it would be a good unique identifier instead. I think it would reduce the possible false positives to when tests within a function are added or reordered, which would be a good occasion to fix the failing tests.
This has some drawbacks: * It makes it hard for developers to jump to the right line based on the failure message.
* It would be even harder for test.winehq.org to link to the right line in the source.
* The relative line numbers would change precisely when the test function is patched, which is when we would like to avoid false positives.
That would require some wrapper around test functions as well to track relative line numbers, but it's maybe not as bad as going over every test message to check and remove irrelevant information.
A quick and dirty PoC: https://gcc.godbolt.org/z/3TELVD
(I haven't thought a lot about it and there's maybe some obvious traps I missed.)
We have cases where test_other_function() calls test_some_function() so declare_winetest() would need to handle nesting. And some functions, but not all, take a line number as a parameter. See for instance check_unicode_string_() in advapi32:lsa. I suspect the combination of both cases could be tricky to handle.
On 3/27/20 5:13 PM, Francois Gouget wrote:
On Fri, 27 Mar 2020, Rémi Bernon wrote: [...]
IIUC the fix also involves updating the failing tests message to mark some information as irrelevant to the failure itself. In this case, why not just remove that information from the message?
Sometimes it's needed to diagnose the failure.
Because the test failure messages are used to identify individual tests and track their results in time they should probably only include meaningful information and if some additional information is still useful for debugging, add a trace right before the test to print it.
That makes the tests more complicated: lots of ok() calls will need an associate trace() which will either push the WineTest reports above the 1.5 MB limit, or will require only tracing the value if the test fails:
else ok(!strcmp(buff, time), "Expected %s, got %s\n", time, buff);
-> else { ok(!strcmp(buff, time), "Unexpected LOCALE_USER_DEFAULT date / format\n"); trace("Expected %s, got %s\n", time, buff); }
or else if (strcmp(buff, time)) { ok(0, "Unexpected LOCALE_USER_DEFAULT date / format\n"); trace("Expected %s, got %s\n", time, buff); }
I guess one could cheat by putting a linefeed before the variable parts.
else ok(!strcmp(buff, time), "Unexpected date / format\nExpected %s, got %s\n", time, buff);
It's kind of ugly and requires formulating the failure message in a specific way but it would work I guess.
[...]
Maybe using the surrounding function name and the test line number relative to it would be a good unique identifier instead. I think it would reduce the possible false positives to when tests within a function are added or reordered, which would be a good occasion to fix the failing tests.
This has some drawbacks:
It makes it hard for developers to jump to the right line based on the failure message.
It would be even harder for test.winehq.org to link to the right line in the source.
You could also include the file and absolute line numbers as well, this was just to use as a unique test identifier instead of its message.
- The relative line numbers would change precisely when the test function is patched, which is when we would like to avoid false positives.
I thought most of the time the test failures were unrelated to the changes (well the test file may be modified, but not necessarily the function where the failure happens). Adding new tests after the existing ones within a function is also a possible mitigation to prevent changing ids.
That would require some wrapper around test functions as well to track relative line numbers, but it's maybe not as bad as going over every test message to check and remove irrelevant information.
A quick and dirty PoC: https://gcc.godbolt.org/z/3TELVD
(I haven't thought a lot about it and there's maybe some obvious traps I missed.)
We have cases where test_other_function() calls test_some_function() so declare_winetest() would need to handle nesting. And some functions, but not all, take a line number as a parameter. See for instance check_unicode_string_() in advapi32:lsa. I suspect the combination of both cases could be tricky to handle.
Yeah of course it's a bit more tricky, not unfeasible though.
Another idea would be to use the stringified condition -combined with the surrounding context, as it's currently the case- as a test identifier instead of its message.
Okay, so, some thoughts:
(1) How many of these are there? Have you compiled a list? Maybe we should just bite the bullet and add broken() for them instead, and defer figuring out why it's actually broken for later.
(2) Remi's comment inspired another option, though it's definitely more work for François: lookup the line number in the relevant source file and compare the source line.
On Fri, 27 Mar 2020, Zebediah Figura wrote:
Okay, so, some thoughts:
(1) How many of these are there? Have you compiled a list? Maybe we should just bite the bullet and add broken() for them instead, and defer figuring out why it's actually broken for later.
I'll start with that part.
I only checked the failures that happen in the base 32 bit test configurations so far (anything that's highlighted in "32 bit WineTest: base VMs" jobs). A brief look leads me to expect finding essentially the same set of issues in the 64 bit tests ("642 bit WineTest: base VMs" jobs), but the Wine jobs may have quite a different set ("WineTest: debiant VM" jobs)
When attacking the "always new" failures I think the top priority is with those:
* comctl32:datetime: Fails in the Arabic locale datetime.c:796: Test failed: Expected 24/03/2020, got 2020/03/24 https://bugs.winehq.org/show_bug.cgi?id=47866
* user32:cursorinfo "wrong info cursor" failure in Chinese locale on Windows 10 cursoricon.c:2366: Test failed: wrong info cursor 003C04BB/00010003 https://bugs.winehq.org/show_bug.cgi?id=48808
* user32:win "unexpected 0x738 message" Windows 10 failures win.c:3858: Test failed: message 0738 available win.c:3852: Test failed: hwnd 00210500 message 0738 https://bugs.winehq.org/show_bug.cgi?id=48815
* user32:win "unexpected 0x0282 message" Chinese Windows 10 failures win.c:3849: Test failed: hwnd 00660590 message 0282 win.c:3928: Test failed: hwnd 00660590/01D90478 message 0282 win.c:3931: Test failed: hwnd 00660590/01D90478 message 0282 https://bugs.winehq.org/show_bug.cgi?id=48819
* user32:win "unexpected 0x7fff message" Japanese Windows 10 failures win.c:3849: Test failed: hwnd 00660590 message 7fff win.c:3928: Test failed: hwnd 00660590/01D90478 message 7fff win.c:3931: Test failed: hwnd 00660590/01D90478 message 7fff https://bugs.winehq.org/show_bug.cgi?id=48820
* Error dialogs cause the user32:winstation tests to fail winstation.c:973: Test failed: unexpected foreground window 00070064 https://bugs.winehq.org/show_bug.cgi?id=47894
* dxgi:dxgi dxgi.c:5185: Test failed: Got unexpected message 0x31f, hwnd 003101F4, wparam 0x1, lparam 0. dxgi.c:5185: Test failed: Got unexpected message 0x46, hwnd 003101F4, wparam 0, lparam 0x28fc30. dxgi.c:5185: Test failed: Got unexpected message 0x24, hwnd 003101F4, wparam 0, lparam 0x28f998.
* kernel32:debugger debugger.c:142: Test failed: unable to open 'C:\Users\winetest\AppData\Local\Temp\wt2983.tmp' debugger.c:142: Test failed: failed to open: C:\Users\winetest\AppData\Local\Temp\wt2983.tmp debugger.c:654: Test failed: the child and debugged pids don't match: 4780 != 6452 -> These are actually quite rare (12 instances in all of test.winehq.org's reports).
I'm deferring investigating the following two because it looks like they may be caused by the "big delay" issues. So I think it makes more sense to revisit them after I've had a look at it from the QEmu side.
* kernel32:comm comm.c:886: Test failed: WaitCommEvent used 1594 ms for waiting The cause for this one may have more to do with the lack of serial baud rate emulation in QEmu. https://bugs.winehq.org/show_bug.cgi?id=48108
* mmdevapi:render render.c:1025: Test failed: Position 6592 too far after playing 100ms
The remaining new failures in that TestBot report look to be more in the "rare failure" category. Sometimes the issue is made worse by the lack of reference results due to WineTest timing out (e.g. w1064v1809-he, the TestBot cannot currently use incomplete test reports but I got an idea for solving that).
On Fri, 27 Mar 2020, Zebediah Figura wrote: [...]
(2) Remi's comment inspired another option, though it's definitely more work for François: lookup the line number in the relevant source file and compare the source line.
That's pretty hard for three reasons: * The test reports are analyzed on the TestBot server which does not have access to the Wine source (giving it access to a Wine checkout is on the todo list for other reasons). * What it needs is the patched source code whereas what it would have access to is the unpatched source. * On the server multiple processes may need access to the patched source code at the same time.
Also a number of tests loop over the test cases so that they all involve the same lines but the failures are still distinct. For instance (from the set of "new" failures in a recent "32 bit WineTest: base VMs" job):
d3d11.c:2729: Test failed: Test 22: Got unexpected hr 0x8007000e (format 0x9). d3d11.c:2729: Test failed: Test 23: Got unexpected hr 0x8007000e (format 0x9). d3d11.c:2729: Test failed: Test 24: Got unexpected hr 0x8007000e (format 0xf). d3d11.c:2729: Test failed: Test 25: Got unexpected hr 0x8007000e (format 0x13).
These all come from the same line but it would be wrong to say that a failure on iteration 22 is equal to the one on iteration 25.
On Fri, 27 Mar 2020, Zebediah Figura wrote:
Okay, so, some thoughts:
(1) How many of these are there? Have you compiled a list? Maybe we should just bite the bullet and add broken() for them instead, and defer figuring out why it's actually broken for later.
I missed the broken() aspect initially. The problem is broken() calls are not meant to be revisited. So we'd either need a /* FIXME */ comment next to it, or an use an unreliable() macro, or better some sort of todo(unreliable).
It seems the first two options would make the test failure disappear entirely so I'd prefer the third one which could issue some sort of message that would not be counted as a failure.
But I'm not sure any of this is simpler or better than what I proposed.
On Fri, 27 Mar 2020 at 16:24, Francois Gouget fgouget@codeweavers.com wrote:
So I'm resubmitting this patch hoping for some feedback since the last time it just dropped off the mailing list [1].
The issue is that some test failures include values that change on every run which prevents the TestBot from knowing whether they are new or not. In turn this means the TestBot systematically blames the author of whichever patch is being tested for these failures.
If the main goal is to stop the testbot from being ignored, and to limit the number of new failures sneaking in, would it make sense to start with something fairly blunt, like ignoring failures for tests on unreliable configurations? E.g., suppose ddraw:ddraw7 reliably passed on w1064v1507, but not w1064v1809, you'd then blacklist all of ddraw:ddraw7 on w1064v1809. That means you potentially ignore some ddraw:ddraw7 tests that are reliable, but it would still be an improvement over effectively ignoring everything. Perhaps more importantly, it also means you can incrementally work on reducing the size of the blacklist, instead of having to do that work upfront.
On 3/27/20 12:14 PM, Henri Verbeet wrote:
On Fri, 27 Mar 2020 at 16:24, Francois Gouget fgouget@codeweavers.com wrote:
So I'm resubmitting this patch hoping for some feedback since the last time it just dropped off the mailing list [1].
The issue is that some test failures include values that change on every run which prevents the TestBot from knowing whether they are new or not. In turn this means the TestBot systematically blames the author of whichever patch is being tested for these failures.
If the main goal is to stop the testbot from being ignored, and to limit the number of new failures sneaking in, would it make sense to start with something fairly blunt, like ignoring failures for tests on unreliable configurations? E.g., suppose ddraw:ddraw7 reliably passed on w1064v1507, but not w1064v1809, you'd then blacklist all of ddraw:ddraw7 on w1064v1809. That means you potentially ignore some ddraw:ddraw7 tests that are reliable, but it would still be an improvement over effectively ignoring everything. Perhaps more importantly, it also means you can incrementally work on reducing the size of the blacklist, instead of having to do that work upfront.
For what it's worth, I also think this is a good idea. That way we don't have to pollute the Wine source, but it's still obvious what tests need fixing.
On Fri, 27 Mar 2020, Henri Verbeet wrote: [...]
If the main goal is to stop the testbot from being ignored, and to limit the number of new failures sneaking in, would it make sense to start with something fairly blunt, like ignoring failures for tests on unreliable configurations? E.g., suppose ddraw:ddraw7 reliably passed on w1064v1507, but not w1064v1809, you'd then blacklist all of ddraw:ddraw7 on w1064v1809. That means you potentially ignore some ddraw:ddraw7 tests that are reliable, but it would still be an improvement over effectively ignoring everything.
So that would mean maintaining a set of (test:unit, testbot-vm) tuples where the TestBot should ignore new failures.
I'm not very fond of the blacklist approach. Once it's in place it may be very tempting to just put every flaky test into it rather than fixing it. This will lead to a long list of exceptions which will have to be maintained. In particular knowing when to remove an entry will be very important.
I also worry that once the test failures are papered over there won't be much incentive to fix them. To be fair that risk is not really different from what could happen with my patch but the scale would be larger.
But it could work with the rare intermittent failures too which would be valuable. And it could be useful when introducing new test configurations that have new intermittent / variable issues. So there could be value in doing this anyway.
Maybe with some safegards it can be made to work.
* I think I'd want a Wine bug describing the issue to be associated with each blacklist entry. That bug should provide some minimal diagnosis: whether it's a new Windows behavior, a race condition or some issue that was reported to QEmu. That would ensure we know why the blacklist entry was added. One could also check the status of the bug when reviewing the blacklist entries. A closed bug would be a strong hint that the blacklist entry is no longer needed.
* And I think it would be better to have a regexp that matches only the troublesome failures rather than to blacklist the whole test unit. Besides being finer grained this would be useful for cases like user32:win which has different issues depending on the locale and where each should be associated to a different bug (bugs 48815, 48819 and 48820).
* I think I'd also want to record the time when the blacklist entry was last used. This relies on having the above regular expression since without it the TestBot would not know anything beyond 'the test unit was run and had failures'. Also the regular expression would only be used against *new* failures. So this would really record the last time the blacklist entry was actually useful.
An entry that was unused for a long time would be a prime candidate for reviewing the corresponding bug and for removal. (Note: The blacklist would also be used on WineTest reports so it would get a chance of matching its target at least 5 days / week).
* I'd want a page listing the blacklisted entries so developers have a good starting point to work on them.
* Ideally the blacklist page would also point to the tasks where the blacklist was last used. I think this would also be useful for developers trying to fix the issues, particularly for the rare intermittent kind.
Note that Wine VMs often test in multiple configurations per task (e.g. wow32 and wow64, different locales), each producing its own test report. So pointing at just the task would leave the developer guessing which report should be looked at. But that's probably ok.
More importantly, (test:unit, testbot-vm) tuples make it impossible to blacklist a specific Wine test configuration such as a specific locale since they all run on the same VM. Similarly it would make blacklisting bitness-blind on Windows VMs.
If necessary the tuple could maybe be extended with the specific mission the blacklist applies to. But I'm not sure on the specific impacts and it may not be worth it.
* Pseudo database schema and sample use:
FailureBlacklists -----------------
PK Bug 48815 PK TestModule user32 PK TestUnit win Name 0x738 message FailureRegExp Test failed: hwnd [0-9A-F]{8,16} message 0738 LastUse 2020-03-27
FailureBlacklistVMs -------------------
PK Bug 48815 PK TestModule user32 PK TestUnit win PK VMName Entries for w1064v1709 w1064v1809 etc.
(48815, user32, win, w1064v1709) (48815, user32, win, w1064v1809) (48815, user32, win, w1064v1809_2scr) ...
FailureBlacklistUses (optionally) ---------------------------------
PK Bug PK TestModule PK TestUnit PK JobId PK StepNo PK TaskNo
(48815, user32, win, 68507, 1, 7) (48815, user32, win, 68508, 1, 7) ...
On 3/31/20 7:25 AM, Francois Gouget wrote:
On Fri, 27 Mar 2020, Henri Verbeet wrote: [...]
If the main goal is to stop the testbot from being ignored, and to limit the number of new failures sneaking in, would it make sense to start with something fairly blunt, like ignoring failures for tests on unreliable configurations? E.g., suppose ddraw:ddraw7 reliably passed on w1064v1507, but not w1064v1809, you'd then blacklist all of ddraw:ddraw7 on w1064v1809. That means you potentially ignore some ddraw:ddraw7 tests that are reliable, but it would still be an improvement over effectively ignoring everything.
So that would mean maintaining a set of (test:unit, testbot-vm) tuples where the TestBot should ignore new failures.
I'm not very fond of the blacklist approach. Once it's in place it may be very tempting to just put every flaky test into it rather than fixing it. This will lead to a long list of exceptions which will have to be maintained. In particular knowing when to remove an entry will be very important.
I also worry that once the test failures are papered over there won't be much incentive to fix them. To be fair that risk is not really different from what could happen with my patch but the scale would be larger.
I'm inclined to think this will happen with any approach that silences failing tests, including your original proposition.
We wouldn't do this for test.winehq.org, presumably (and as I understand, the two systems are different enough that it's not difficult).
But it could work with the rare intermittent failures too which would be valuable. And it could be useful when introducing new test configurations that have new intermittent / variable issues. So there could be value in doing this anyway.
Maybe with some safegards it can be made to work.
- I think I'd want a Wine bug describing the issue to be associated with each blacklist entry. That bug should provide some minimal diagnosis: whether it's a new Windows behavior, a race condition or some issue that was reported to QEmu. That would ensure we know why the blacklist entry was added. One could also check the status of the bug when reviewing the blacklist entries. A closed bug would be a strong hint that the blacklist entry is no longer needed.
That seems reasonable regardless of what approach we take.
And I think it would be better to have a regexp that matches only the troublesome failures rather than to blacklist the whole test unit. Besides being finer grained this would be useful for cases like user32:win which has different issues depending on the locale and where each should be associated to a different bug (bugs 48815, 48819 and 48820).
I think I'd also want to record the time when the blacklist entry was last used. This relies on having the above regular expression since without it the TestBot would not know anything beyond 'the test unit was run and had failures'. Also the regular expression would only be used against *new* failures. So this would really record the last time the blacklist entry was actually useful.
An entry that was unused for a long time would be a prime candidate for reviewing the corresponding bug and for removal. (Note: The blacklist would also be used on WineTest reports so it would get a chance of matching its target at least 5 days / week).
I'd want a page listing the blacklisted entries so developers have a good starting point to work on them.
Ideally the blacklist page would also point to the tasks where the blacklist was last used. I think this would also be useful for developers trying to fix the issues, particularly for the rare intermittent kind.
Note that Wine VMs often test in multiple configurations per task (e.g. wow32 and wow64, different locales), each producing its own test report. So pointing at just the task would leave the developer guessing which report should be looked at. But that's probably ok.
More importantly, (test:unit, testbot-vm) tuples make it impossible to blacklist a specific Wine test configuration such as a specific locale since they all run on the same VM. Similarly it would make blacklisting bitness-blind on Windows VMs.
If necessary the tuple could maybe be extended with the specific mission the blacklist applies to. But I'm not sure on the specific impacts and it may not be worth it.
Pseudo database schema and sample use:
FailureBlacklists
PK Bug 48815 PK TestModule user32 PK TestUnit win Name 0x738 message FailureRegExp Test failed: hwnd [0-9A-F]{8,16} message 0738 LastUse 2020-03-27
FailureBlacklistVMs
PK Bug 48815 PK TestModule user32 PK TestUnit win PK VMName Entries for w1064v1709 w1064v1809 etc.
(48815, user32, win, w1064v1709) (48815, user32, win, w1064v1809) (48815, user32, win, w1064v1809_2scr) ...
FailureBlacklistUses (optionally)
PK Bug PK TestModule PK TestUnit PK JobId PK StepNo PK TaskNo
(48815, user32, win, 68507, 1, 7) (48815, user32, win, 68508, 1, 7) ...
I think all of this looks reasonable to me as well.
On Tue, 31 Mar 2020, Zebediah Figura wrote: [...]
We wouldn't do this for test.winehq.org, presumably (and as I understand, the two systems are different enough that it's not difficult).
Yes. test.winehq.org is a completely different codebase.