WineTest failure patterns preview

List overview All Threads

newer

older

[PATCH v3 1/4] strmbase: Don't...

[PATCH v2] server: Do not allow...

Francois Gouget

23 Apr 2021 23 Apr '21

12:54 p.m.

I did not get to send this patch set yet so here's a preview:

* Windows test patterns http://fgouget.free.fr/tmp/winepatterns/patterns-tb-win.html

* Wine test patterns http://fgouget.free.fr/tmp/winepatterns/patterns-tb-wine.html

So for each test unit that had failures you have: * one line per test configuration (report tag) * one row per WineTest build (daily commit)

Each cell has a one character description of the test result (with more details in the tooltip) and is color coded to help identify patterns. Each 'failure' cell also links to the corresponding report page.

The main goal of these pages for me is to simplify detecting new failures in the WineTest results... and not just in cases where there was no failure before since there are so many cases with preexisting failures already. In the patterns an increase in the number of failures shows up as a color change which makes them easy to spot.

A secondary goal is to allow comparing the before/after results when a test configuration (VM or otherwise) is modified.

A side effect is that these can also help identify the different failure modes. For instance comctl32:monthcal fails on wednesdays, except in Korean where it's all the time: http://fgouget.free.fr/tmp/winepatterns/patterns-tb-win.html#comctl32:monthc...

To help with these tasks the page also sorts the test units to show those that look like they have new failures first. So starting the review from the top is most likely to point to the new issues.

-- Francois Gouget fgouget@codeweavers.com

Show replies by date

Zebediah Figura (she/her)

1 May 1 May

10:32 p.m.

On 4/23/21 7:54 AM, Francois Gouget wrote:

...

I did not get to send this patch set yet so here's a preview:

Windows test patterns http://fgouget.free.fr/tmp/winepatterns/patterns-tb-win.html

Wine test patterns http://fgouget.free.fr/tmp/winepatterns/patterns-tb-wine.html

So for each test unit that had failures you have:

one line per test configuration (report tag)

one row per WineTest build (daily commit)

Each cell has a one character description of the test result (with more details in the tooltip) and is color coded to help identify patterns. Each 'failure' cell also links to the corresponding report page.

The main goal of these pages for me is to simplify detecting new failures in the WineTest results... and not just in cases where there was no failure before since there are so many cases with preexisting failures already. In the patterns an increase in the number of failures shows up as a color change which makes them easy to spot.

A secondary goal is to allow comparing the before/after results when a test configuration (VM or otherwise) is modified.

A side effect is that these can also help identify the different failure modes. For instance comctl32:monthcal fails on wednesdays, except in Korean where it's all the time: http://fgouget.free.fr/tmp/winepatterns/patterns-tb-win.html#comctl32:monthc...

To help with these tasks the page also sorts the test units to show those that look like they have new failures first. So starting the review from the top is most likely to point to the new issues.

Looks like a more sophisticated version of https://www.winehq.org/~jwhite/2deb8c2825af.html, which is definitely a nice resource when I'm trying to put effort into fixing test failures.

I guess the tests are color-coded by number of failures, modulo some constant? I like the idea. I will note though that some of those colours seem hard to tell apart, e.g. the shades of green in wine d3d9:device. Also I guess they aren't consistent across tests for some reason?

Francois Gouget

3 May 3 May

12:07 a.m.

On Sat, 1 May 2021, Zebediah Figura (she/her) wrote: [...]

...

Looks like a more sophisticated version of https://www.winehq.org/~jwhite/2deb8c2825af.html, which is definitely a nice resource when I'm trying to put effort into fixing test failures.

Right. I should probably have mentioned this bug which says Jer's page was part of the inspiration. But that page did not do what I need so I tweaked it.

https://bugs.winehq.org/show_bug.cgi?id=48164

Oh. And now the official pages are online and getting more feature complete.

https://test.winehq.org/data/patterns-tb-win.html https://test.winehq.org/data/patterns-tb-wine.html

...

I guess the tests are color-coded by number of failures, modulo some constant?

Right. Each failure type (timeout, crash, etc) has its own color. And then I use a gradient to attribute a color to each 'vanilla' failure count.

Note that what counts for allocating the colors is not the actual failure counts, but the number of different failure counts. That is a test with 4, 5 or 6 failures will get the same colors as one with 1, 2 or 100 failures because in both cases there are only 3 different values.

I'll add a description of the patterns on the pages at some point.

...

I like the idea. I will note though that some of those colours seem hard to tell apart, e.g. the shades of green in wine d3d9:device.

Yes. When a test unit has 30 different failure counts it's hard to find enough easy to distinguish colors. It's probably possible to do better by tweaking the colors the gradient goes through.

https://source.winehq.org/git/tools.git/blob/HEAD:/winetest/build-patterns#l...

The cyan-green-yellow part of the gradient produces colors that are not very easy to distinguish. The colors in the yellow-red part seem easier to identify but that gradient is given the same weight as the other two. I've experimented a bit with a darker cyan but going too dark does not look very nice.

...

Also I guess they aren't consistent across tests for some reason?

The goal is to maximize the contrast in the colors used by each pattern. But if I used a single 'color map' for all test units, I would need to allocate a hundred different colors. Then many test units with just a few failures would end up only using very similar colors.

Allocating one color map per test unit limits this issue to just a few patterns. And the best fix would be to reduce the number of failures in these tests ;-)

-- Francois Gouget fgouget@codeweavers.com

Zebediah Figura (she/her)

12:54 a.m.

On 5/2/21 7:07 PM, Francois Gouget wrote:

...

On Sat, 1 May 2021, Zebediah Figura (she/her) wrote: [...]

...
Looks like a more sophisticated version of https://www.winehq.org/~jwhite/2deb8c2825af.html, which is definitely a nice resource when I'm trying to put effort into fixing test failures.

Right. I should probably have mentioned this bug which says Jer's page was part of the inspiration. But that page did not do what I need so I tweaked it.

https://bugs.winehq.org/show_bug.cgi?id=48164

Oh. And now the official pages are online and getting more feature complete.

https://test.winehq.org/data/patterns-tb-win.html https://test.winehq.org/data/patterns-tb-wine.html

...
I guess the tests are color-coded by number of failures, modulo some constant?

Right. Each failure type (timeout, crash, etc) has its own color. And then I use a gradient to attribute a color to each 'vanilla' failure count.

Note that what counts for allocating the colors is not the actual failure counts, but the number of different failure counts. That is a test with 4, 5 or 6 failures will get the same colors as one with 1, 2 or 100 failures because in both cases there are only 3 different values.

I'll add a description of the patterns on the pages at some point.

...
I like the idea. I will note though that some of those colours seem hard to tell apart, e.g. the shades of green in wine d3d9:device.

Yes. When a test unit has 30 different failure counts it's hard to find enough easy to distinguish colors. It's probably possible to do better by tweaking the colors the gradient goes through.

https://source.winehq.org/git/tools.git/blob/HEAD:/winetest/build-patterns#l...

The cyan-green-yellow part of the gradient produces colors that are not very easy to distinguish. The colors in the yellow-red part seem easier to identify but that gradient is given the same weight as the other two. I've experimented a bit with a darker cyan but going too dark does not look very nice.

...
Also I guess they aren't consistent across tests for some reason?

The goal is to maximize the contrast in the colors used by each pattern. But if I used a single 'color map' for all test units, I would need to allocate a hundred different colors. Then many test units with just a few failures would end up only using very similar colors.

Allocating one color map per test unit limits this issue to just a few patterns. And the best fix would be to reduce the number of failures in these tests ;-)

I'll admit I don't fully follow your logic.

I guess if it were me, I'd use a fixed colormap of a small fixed number (16? I'm guessing there) of colors that are easy to distinguish, and then universally assign colors by (n % 16). I'd also pick out those colors manually instead of trying to generate them. Yeah, you won't be able to distinguish between 1 failure and 17 failures, but hopefully that contrast won't come up very much. Plus, that way, you could even learn a mental association, I guess, for whatever that's worth.

Or you could assign 1-16 to individual colors and anthing greater than 16 to another color. Of course many tests have very large numbers of failures (usually repeated of course).

That's kind of splitting hairs of course.

Now, another thing that occurs to me that would be very useful, and which doesn't necessarily preclude any of the above but does sort of obviate its usefulness, is to generate a list of failures by line, or even by line + failure message. I'd envision this as one per row, with "X" flags on each machine + day that displays it. Of course I'm sure you already have plenty of ideas on expanding the page; I'm just throwing out one of my own here.

Francois Gouget

4:58 p.m.

On Sun, 2 May 2021, Zebediah Figura (she/her) wrote: [...]

...

I guess if it were me, I'd use a fixed colormap of a small fixed number (16? I'm guessing there) of colors that are easy to distinguish, and then universally assign colors by (n % 16).

16 colors is not enough, particularly not if using a single palette for all the tests.

For instance the record holder is user32:clipboard with 81 different failure counts: https://test.winehq.org/data/patterns.html#user32:clipboard

So with a 16 color palette there would be a lot of wrapping and that would likely make the pattern unreadable.

Also note that even with the current scheme one can clearly see that the cw-rx460 machine has more failures than the other test configurations. Partly because it's almost the only machine present in that pattern, and partly because it has more yellow/red which are the colors of higher failure counts.

In contrast the non-English w10pro64 VMs have fewer failures (blue) and all the same color (and hence count). This suggests they have a different cause. (for cw-rx460 it's the Radeon driver, I have not looked at w10pro64 yet)

user32:input is another case where the current color scheme works pretty well despite the high number of different failure counts (31).

https://test.winehq.org/data/patterns.html#user32:input

(And it shows something pretty bad happened on cw-gtx560-1909 around Apr 2nd. Now I just have to figure out what)

For reference, here are the 'high scores': 81 user32:clipboard 41 user32:win 31 user32:input 27 ole32:clipboard 26 d3d11:d3d11 25 user32:msg 21 user32:sysparams 20 d3d10core:d3d10core

...

I'd also pick out those colors manually instead of trying to generate them.

I'm fine with someone picking the colors manually but I'm not an artist and agonising over each color is not going to be a time saver for me.

...

Yeah, you won't be able to distinguish between 1 failure and 17 failures, but hopefully that contrast won't come up very much.

Distinguishing between 1 and 17 failures is super important: it's the difference between catching a commit that introduces 16 news failures in the days after it's committed, and letting it slip through the cracks, only to be rediscovered months later when the author has vanished.

...

Plus, that way, you could even learn a mental association, I guess, for whatever that's worth.

Precisely: what is it worth? What advantage does being able to identify at a glance that two test units have the same number of failures gain us?

[...]

...

Now, another thing that occurs to me that would be very useful, and which doesn't necessarily preclude any of the above but does sort of obviate its usefulness, is to generate a list of failures by line, or even by line + failure message.

Line numbers are useless for tracking failures: they change almost every time a test is modified. Matching on the message may work better, though some have 'random' content (pointers, etc). But fortunately they are relatively rare.

...

I'd envision this as one per row, with "X" flags on each machine + day that displays it.

Web pages are two-dimensional. So if rows are failure messages that only leaves columns to show both the reports and builds. That feels like one too many.

Or maybe instead of one box per test unit you meant to have one per failure message? That's likely going to be many boxes (there's already 327 test units that had failures in the past 2 months!!!).

I had a possibly related idea for tracking individual failures but I'm not entirely sure it would work in practice: https://bugs.winehq.org/show_bug.cgi?id=48166

...

Of course I'm sure you already have plenty of ideas on expanding the page; I'm just throwing out one of my own here.

Not that many actually. * Adding some sort of documentation.

* Adding links to potentially related Git commits.

* Adding a global pattern based on the number of failed test units. (that one also highlights the cw-rx460 issues pretty well)

* Adding links to related bugs. But ideally that would use Bugzilla's rest API which is only available in Bugzilla >= 5.0 (WineHQ still runs 4.4.13, I don't know if it's worth upgrading Bugzilla just for this).

That's about it.

-- Francois Gouget fgouget@codeweavers.com

Francois Gouget

27 May 27 May

2:45 p.m.

New subject: WineTest failure patterns

The new "failure patterns" pages are now complete:

* All results https://test.winehq.org/data/patterns.html * TestBot's Windows VMs only https://test.winehq.org/data/patterns-tb-win.html * TestBot's Wine VMs only https://test.winehq.org/data/patterns-tb-wine.html

The main additions are:

* The pages start with a bit of documentation.

* Before the tests there is now an 'overview' pattern showing the number of failed test units for each machine. This is useful to detect when a VM goes bad, for instance when Windows 10 decides to pop up a 'first use' configuration dialog: that line will turn red. This can also highlight machines that get worse results than their peers, and thus warrant a check. You will also notice a couple of machines that only have a handful of failures!

* For each test the pages link to related commits. For each commit they show if it modified the test itself, a shared test resource or the Wine module. This simplifies reviewing potential culprits when a test starts failing.

* For each test the pages link to the related bugs. This simplifies checking if someone has already analysed the bug or tried fixing it. If the bug has a regression commit id, that commit is also shown as such.

* The page also shows bugs for tests that have no failure. This simplifies verifying that the bugs have been closed when a test is fixed.

-- Francois Gouget fgouget@codeweavers.com

Zebediah Figura (she/her)

4:06 p.m.

New subject: WineTest failure patterns

On 5/27/21 9:45 AM, Francois Gouget wrote:

...

The new "failure patterns" pages are now complete:

All results https://test.winehq.org/data/patterns.html

TestBot's Windows VMs only https://test.winehq.org/data/patterns-tb-win.html

TestBot's Wine VMs only https://test.winehq.org/data/patterns-tb-wine.html

The main additions are:

The pages start with a bit of documentation.

Before the tests there is now an 'overview' pattern showing the number of failed test units for each machine. This is useful to detect when a VM goes bad, for instance when Windows 10 decides to pop up a 'first use' configuration dialog: that line will turn red. This can also highlight machines that get worse results than their peers, and thus warrant a check. You will also notice a couple of machines that only have a handful of failures!

For each test the pages link to related commits. For each commit they show if it modified the test itself, a shared test resource or the Wine module. This simplifies reviewing potential culprits when a test starts failing.

For each test the pages link to the related bugs. This simplifies checking if someone has already analysed the bug or tried fixing it. If the bug has a regression commit id, that commit is also shown as such.

The page also shows bugs for tests that have no failure. This simplifies verifying that the bugs have been closed when a test is fixed.

Nice, thanks for setting this up!

1451

Age (days ago)

1485

Last active (days ago)

wine-devel@winehq.org

6 comments

2 participants

tags (0)

participants (2)

Francois Gouget
Zebediah Figura (she/her)