On 3/27/20 12:53 PM, Francois Gouget wrote:
So I'm resubmitting this patch hoping for some feedback since the last time it just dropped off the mailing list [1].
The issue is that some test failures include values that change on every run which prevents the TestBot from knowing whether they are new or not. In turn this means the TestBot systematically blames the author of whichever patch is being tested for these failures.
Here are some examples:
user32:msg - Test failed: hwnd 0028050C message 0738
The message value is important: an unexpected 0738 message has a completely different cause than an unexpected WM_MOUSEFIRST message. So the TestBot should not consider a failure about message 0738 to be identical to a failure about message 0200.
But the value of the window handle does not change the nature of the message. So a failure about hwnd 0028050C is identical to one about hwnd 0043050E.
kernel32:comm - Test failed: WaitCommEvent used 1594 ms for waiting
WaitCommEvent() took longer than expected and knowing by how much can inform on the right approach to deal with it. But a failure with a 1594 ms overrun is the same as one with a 869 ms overrun.
comctl32:datetime.c - Test failed: Expected 24/03/2020, got 2020/03/24
The failure message obviously changes every day (because this test can only be about the current date), and yet is the same.
Because a basic string comparison of past failures to the current one always show they are different, the TestBot systematically claims that any patch involving these test units is bad. As anyone who cries wolf all the time, it ends up being ignored and thus is not serving its purpose.
So the solution I propose is to add some delimiter around the variable parts so the TestBot can identify and ignore them when comparing failures. This way it will be able to identify the really meaningful changes.
For instance one could enclose the irrelevant parts in double-parentheses [2] as follows:
Test failed: hwnd ((0028050C)) message 0738 Test failed: WaitCommEvent used ((1594)) ms for waiting Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24)) [3]
Pros:
This provides a quick way to significantly improve the TestBot results. Where fixing each and every one of these failures would likely take weeks if not months, adding parentheses where relevant could likely be done by a single developer (probably me) in a week or so.
With the "always new" failures out of the picture the Wine developers can focus on the rare intermittent ones, for which there is no quick fix. (Rare intermittent failures are those that happen less than 5% of the time and which are thus not present in a given test configuration's WineTest results history, but which still have a siginificant chance of happening when a patch is tested against 12 or more configurations: 3% < 5% but 12 * 3% = 36%);
If the only way to prevent these failures from showing up as "always new" is to fix the failure, it can be tempting to just remove the test or skip it in the cases where it fails (e.g. specific locales). This could mean losing valuable information about how Windows behaves.
We will get a steady stream of "always new" failures as new tests are added and the test configurations change (e.g. new Windows versions). Delimiting the variable parts provides a timely way to deal with these new issues.
The TestBot patch is really simple (single-liner, 4 lines with comments).
Cons:
Once developers no longer get blamed all the time for these failures they may have less incentive to fix them.
-> I don't really buy that argument. Wine developers have shown they can very well ignore these test failures for years and there is no consequence to doing so anyway.
There is a risk of failure messages containing the chosen delimiters for unrelated reasons. So using double-quotes as the delimiters as in "0028050C" would be a bad idea.
-> But a carefully chosen delimiter greatly reduces that risk and in the worst case it will only cause a few failure messages to not be detected as new. [2]
This requires adding 'tagging' to (many) failure messages.
-> While failure messages can be tagged preventively, only those that fail really must be tagged. Also I feel the 'tagging' is pretty minimal.
[1] https://www.winehq.org/pipermail/wine-devel/2019-December/157168.html
[2] I'm not dead set on double parentheses. I'm fine with anything that has a low false-positive probability. Note that this eliminates double-quotes, single parentheses "(360, 200)", curly brackets "{360, 200}", single angle brackets "exp<0000000000>", tildes (short pathnames), etc. But pretty much double-anything would work. Maybe double-angle brackets <<0028050C>>?
[3] In this last example I feel that comparing the failure based on just "Expected, got" makes it too generic.
diff --git a/testbot/lib/WineTestBot/LogUtils.pm b/testbot/lib/WineTestBot/LogUtils.pm index 27d38567ea..842ebc8808 100644 --- a/testbot/lib/WineTestBot/LogUtils.pm +++ b/testbot/lib/WineTestBot/LogUtils.pm @@ -1086,6 +1086,10 @@ sub _GetLineKey($) my ($Line) = @_; return undef if (!defined $Line);
- # Remove variable parts.
- # Use a non-greedy match to ensure the ignored part does not contain '))'.
- $Line =~ s/((.+?))/(())/g;
- # Remove the line number $Line =~ s/^([_a-z0-9]+.c:)\d+:( Test (?:failed|succeeded inside todo block): )/$1$2/
IIUC the fix also involves updating the failing tests message to mark some information as irrelevant to the failure itself. In this case, why not just remove that information from the message?
Because the test failure messages are used to identify individual tests and track their results in time they should probably only include meaningful information and if some additional information is still useful for debugging, add a trace right before the test to print it.
I think the main problem is that it's difficult to uniquely identify a given test, while keeping it easy to modify existing or add new ones.
Maybe using the surrounding function name and the test line number relative to it would be a good unique identifier instead. I think it would reduce the possible false positives to when tests within a function are added or reordered, which would be a good occasion to fix the failing tests.
That would require some wrapper around test functions as well to track relative line numbers, but it's maybe not as bad as going over every test message to check and remove irrelevant information.
A quick and dirty PoC: https://gcc.godbolt.org/z/3TELVD
(I haven't thought a lot about it and there's maybe some obvious traps I missed.)
Cheers,