In parallel with the MR test results I analyzed the "new" test failures in each nightly WineTest run and updated the TestBot known failures accordingly.
A "new" failure in the WineTest results is equivalent to a "false positive" in the MR tests: a lot of them are not new but rare failures or ones where the messages changes in every run due to the presence of random values (e.g. pointers, handles).
Analyzing the WineTest results is particularly important for the Windows VMs: * The TestBot runs the full 32-bit test suite in Wine for every MR which ensures a good coverage of the tests and a lot of opportunity to detect flaky tests. * But on Windows the TestBot only runs the tests directly impacted by the MR which provides less opportunity for detecting which tests are flaky on Windows.
So analyzing the nightly Windows WineTest results provides more opportunity for detecting which tests are unreliable on Windows before they cause trouble for MRs.
The results show improvements: they are now regularly below 10 new failures whereas before 11-29 there was over 20 in each run (and almost 20 when deduplicated, see the attached spreadsheet). There's quite a bit of noise so we'll know more in the coming weeks.
Also the Windows 11 test configurations don't have too many new failures anymore so it should be possible to test the MRs against them soon.
I attached the raw data in failures-winetest.txt and an updated spreadsheet (failures.xls).