Re: RFC How to get rid of "always new" TestBot false positives?

27 Mar 2020

On 3/27/20 12:53 PM, Francois Gouget wrote:
...
So I'm resubmitting this patch hoping for some feedback since the last
time it just dropped off the mailing list [1].
The issue is that some test failures include values that change on every
run which prevents the TestBot from knowing whether they are new or not.
In turn this means the TestBot systematically blames the author of
whichever patch is being tested for these failures.
Here are some examples:

user32:msg - Test failed: hwnd 0028050C message 0738
The message value is important: an unexpected 0738 message has a
 completely different cause than an unexpected WM_MOUSEFIRST message.
 So the TestBot should not consider a failure about message 0738 to be
 identical to a failure about message 0200.
But the value of the window handle does not change the nature of the
 message. So a failure about hwnd 0028050C is identical to one about
 hwnd 0043050E.

kernel32:comm - Test failed: WaitCommEvent used 1594 ms for waiting
WaitCommEvent() took longer than expected and knowing by how much can
 inform on the right approach to deal with it. But a failure with a
 1594 ms overrun is the same as one with a 869 ms overrun.

comctl32:datetime.c - Test failed: Expected 24/03/2020, got 2020/03/24
The failure message obviously changes every day (because this test
 can only be about the current date), and yet is the same.


Because a basic string comparison of past failures to the current one
always show they are different, the TestBot systematically claims that
any patch involving these test units is bad. As anyone who cries wolf
all the time, it ends up being ignored and thus is not serving its
purpose.
So the solution I propose is to add some delimiter around the variable
parts so the TestBot can identify and ignore them when comparing
failures. This way it will be able to identify the really meaningful
changes.
For instance one could enclose the irrelevant parts in
double-parentheses [2] as follows:
 Test failed: hwnd ((0028050C)) message 0738
 Test failed: WaitCommEvent used ((1594)) ms for waiting
 Test failed: Unexpected date value ((24/03/2020)), got ((2020/03/24))     [3]



Pros:

This provides a quick way to significantly improve the TestBot
 results. Where fixing each and every one of these failures would
 likely take weeks if not months, adding parentheses where relevant
 could likely be done by a single developer (probably me) in a week or
 so.

With the "always new" failures out of the picture the Wine developers
 can focus on the rare intermittent ones, for which there is no quick
 fix. (Rare intermittent failures are those that happen less than 5% of
 the time and which are thus not present in a given test
 configuration's WineTest results history, but which still have a
 siginificant chance of happening when a patch is tested against 12 or
 more configurations: 3% < 5% but 12 * 3% = 36%);

If the only way to prevent these failures from showing up as "always
 new" is to fix the failure, it can be tempting to just remove the test
 or skip it in the cases where it fails (e.g. specific locales). This
 could mean losing valuable information about how Windows behaves.

We will get a steady stream of "always new" failures as new tests are
 added and the test configurations change (e.g. new Windows versions).
 Delimiting the variable parts provides a timely way to deal with these
 new issues.

The TestBot patch is really simple (single-liner, 4 lines with
 comments).


Cons:

Once developers no longer get blamed all the time for these failures
 they may have less incentive to fix them.
-> I don't really buy that argument. Wine developers have shown they
    can very well ignore these test failures for years and there is no
    consequence to doing so anyway.

There is a risk of failure messages containing the chosen delimiters
 for unrelated reasons. So using double-quotes as the delimiters as in
 "0028050C" would be a bad idea.
-> But a carefully chosen delimiter greatly reduces that risk and in
    the worst case it will only cause a few failure messages to not be
    detected as new. [2]

This requires adding 'tagging' to (many) failure messages.
-> While failure messages can be tagged preventively, only those
    that fail really must be tagged. Also I feel the 'tagging' is
    pretty minimal.


[1] https://www.winehq.org/pipermail/wine-devel/2019-December/157168.html
[2] I'm not dead set on double parentheses. I'm fine with anything that
     has a low false-positive probability. Note that this eliminates
     double-quotes, single parentheses "(360, 200)", curly brackets
     "{360, 200}", single angle brackets "exp<0000000000>", tildes (short
     pathnames), etc. But pretty much double-anything would work. Maybe
     double-angle brackets <<0028050C>>?
[3] In this last example I feel that comparing the failure based on just
     "Expected, got" makes it too generic.

diff --git a/testbot/lib/WineTestBot/LogUtils.pm b/testbot/lib/WineTestBot/LogUtils.pm
index 27d38567ea..842ebc8808 100644
--- a/testbot/lib/WineTestBot/LogUtils.pm
+++ b/testbot/lib/WineTestBot/LogUtils.pm
@@ -1086,6 +1086,10 @@ sub _GetLineKey($)
    my ($Line) = @_;
    return undef if (!defined $Line);

# Remove variable parts.
# Use a non-greedy match to ensure the ignored part does not contain '))'.
$Line =~ s/((.+?))/(())/g;
# Remove the line number
  $Line =~ s/^([_a-z0-9]+.c:)\d+:( Test (?:failed|succeeded inside todo block): )/$1$2/

IIUC the fix also involves updating the failing tests message to mark 
some information as irrelevant to the failure itself. In this case, why 
not just remove that information from the message?
Because the test failure messages are used to identify individual tests 
and track their results in time they should probably only include 
meaningful information and if some additional information is still 
useful for debugging, add a trace right before the test to print it.
I think the main problem is that it's difficult to uniquely identify a 
given test, while keeping it easy to modify existing or add new ones.
Maybe using the surrounding function name and the test line number 
relative to it would be a good unique identifier instead. I think it 
would reduce the possible false positives to when tests within a 
function are added or reordered, which would be a good occasion to fix 
the failing tests.
That would require some wrapper around test functions as well to track 
relative line numbers, but it's maybe not as bad as going over every 
test message to check and remove irrelevant information.
A quick and dirty PoC: https://gcc.godbolt.org/z/3TELVD
(I haven't thought a lot about it and there's maybe some obvious traps I 
missed.)
Cheers,
-- 
Rémi Bernon rbernon@codeweavers.com


    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: RFC How to get rid of "always new" TestBot false positives?