Hi,
This last week-end I updated 5 of the test machines, impacting 29 test configurations.
Here's the rundown:
* w1064* w1064 is now called w1064v2009. And w1064 and all the derived snapshots are now running on Windows 10 21H2 (upgraded through Windows Update).
* w10pro64v2004 This snapshot used to have some specific failures most likely because Windows Update was still installing already downloaded updates when I initially took the snapshot (see bug 52560). So I let the VM quiet down and retook the snapshot. That seems to have improved things. I'll know for sure in a few days.
* w10pro64_* w10pro64 had the same issue as w10pro64v2004, plus I needed to install some more languages which requires un-metering the network at which time Windows Update kicks in...
So I applied all updates and retook this snapshot. It's now on the latest 21H1; Windows did not offer installing 21H2 for some reason. It's just as well, this way test.winehq.org can separate the results from this VM with all its locale tests, from those of the w1064 VM which is more about past Windows versions and other configuration options.
Then I let the TestBot recreate all the snapshots for the locale tests. Windows has a bunch of locales: formats, display language, system locale, keyboard layout, country. The old snapshots mostly had the display language right but sometimes the other locales were still stuck on English. The new snapshots are all consistent on that front (whenever possible see below).
In the process I lost w10pro64_pt_PT which the new SetWinLocale have trouble with (yet this works on my test VM so I blame the latest 21H1 updates). I also wanted to add a 'mixed locales' test configuration but the scripts failed this one too. I'll investigate and get those online later on.
* w10pro64_hi_u8 Before Windows 10 it was impossible to set the the system locale to some values. For instance it was impossible to set it to Hindi. So the w10pro64_hi still has English as its system locale unlike the other locale test configurations.
Windows 10 lifted this restriction but this comes with two caveats: - Setting the system locale to values like Hindi is possible but it requires setting the codepage to UTF-8. - This is considered beta and requires checking an extra box in the GUI.
w10pro64_hi_u8 is one such configuration: Hindi through and through with UTF-8 as the codepage. And based on the test results I think they were not kidding with the 'beta' aspect: I count at least 11 tests with UTF-8-specific failures on Windows. Some of these may be because of issues in the test but I'm pretty sure some are Windows bugs we will have to work around.
I would have liked to add another UTF-8 test configuration if only to tweeze out Hindi-specific issues from the UTF-8 ones. I planned to use en-EA for that because it's an all English UTF-8-only locale which could be nice in case there are any error message that need reading... But that's another locale that requires a SetWineLocale tweak.
* cw-gtx560 This is one of the two not-really-TestBot machines with real graphics cards.
I added a Windows 10 21H2 snapshot. Note that again Windows did not want to update it to 21H2 so I used Windows10Upgrade9252.exe to force it. As far as I know the Nvidia driver is unchanged (391.35 iirc). That machine's other snapshots are unchanged.
* cw-rx460 I also added a Windows 10 21H2 snapshot (Windows10Upgrade9252.exe again, the other snapshots are unchanged).
Windows was suggesting I upgrade the GPU driver so I upgraded to the latest Adrenalin 22.4.1 driver. In the past I had trouble with some of the AMD drivers where it would either interfere with the clipboard (causing tests to fail), or crash entirely (causing WineTest to fail entirely).
So consider this new driver to be on probation.
* w7u* I moved this VM from one host to another on 2022-03-14 which is when I found issues with the new LibvirtTool and SetWinLocale scripts so I ended up having to create most snapshots by hand :-( I restored it again from backup to test the corrected scripts and this time all went well.
The last move coincided with some new failures in mf:mf (2qxl only), user32:sysparams and user32:win (2qxl only). It's possible the 2qxl issues are because the VM is powered on for each test instead of starting from a live snapshot: I suspect Windows 7 did not correctly handle the multi-monitor setup when restored from a live snapshot. This will require some more investigation...
On 4/28/22 00:08, Francois Gouget wrote:
Hi,
This last week-end I updated 5 of the test machines, impacting 29 test configurations.
Hi Francois,
Thanks for all these work.
There is one thing I'd like to mention, which is that all these added VMs are slowing down TestBots. For example, now during my normal daytime hours from 9AM to 6PM China Standard Time, the TestBots are always busy with running the batch tests. Previously they're mostly finished in the afternoon. Most of the time, the wait time is fine. However, there are also increasing times that I have to wait too long that I feel like my workflow is being disrupted.
Could you make them go faster? Maybe balancing the load a bit or adding more hardware?
Thanks, Zhiyi
On Thu, 28 Apr 2022, Zhiyi Zhang wrote: [...]
Could you make them go faster? Maybe balancing the load a bit or adding more hardware?
The issue is that WineTest takes time and new jobs have to wait for running tasks to complete to get their turn. But then they have priority over WineTest.
I collected some data about the WineTest tasks (see attached spreadsheet) and they take between 25 minutes on Windows and 35 minutes on Linux. The main issue here is VMs that have many test configurations which must therefore be run sequentially. The three VMs with the longest chains are:
Time Configs VM 6.7 h 11 debian11 6.7 h 16 w1064 6.8 h 15 w10pro64
What this means is that no amount of rebalancing can get the tests to run in less than about 7 hours.
And here are the results at the VM host level:
Time Configs Host 7.2 h 12 vm1 1.4 h 3 vm2 7.7 h 18 vm3 12.1 h 25 vm4
The issue is vm2 is too slow and old to run most VMs nowadays. So moving some test configurations from vm4 to vm1 or vm3 will push those to 9 / 10 hours. So I'll restart the process of getting new hardware to replace vm2.
The other options: * Fix the tests that get stuck: they waste 2 minutes each. But it looks like there's only two of those left, conhost.exe:tty and wscript.exe:run, so there's not much to gain.
* Speed up the slow tests, potentially by using multi-threading. What sucks is we have no way of tracking which tests are slow, which test configurations are slow, etc. It would be nice to have something like the patterns page but for runtime (and also for the tests output size).
* Getting hardware with faster single thread performance: over 90% of the tests are single-threaded. vm2 is meant to be the first step towards this.
* Splitting the VMs with many test configurations so the test load can be spread across multiple hosts. That is, instead of having a single VM with 15 test configurations that must run sequentially like w10pro64, have two VMs with 7 and 8 configurations each that can run in parallel. But that makes an extra VM to manage and requires having hosts to spread them to :-(
* Load balancing could help, assuming the TestBot is smart enough.
That is, if it starts by running the debiant and w7u tasks on vm4, then by the time the other hosts are idle all that's left to run is w10pro64's 15 test configurations that must be run sequentially anyway. So the scheduler must give priority to the VMs with the highest count of pending tasks.
Load balancing could help reduce the latency by ensuring the builds are done earlier. Here's a worst case scenario right now: t=0 vm2 starts a WineTest job t=1 Developper submits a job. First comes the build step t=25 vm2 completes the WineTest job t=25 vm1, vm3 and vm4 each start a new WineTest job t=26 vm2 completes the developer's build task t=50 vm1, vm3 and vm4 complete their WineTest task t=51 vm1, vm3 and vm4 starts the developer's Windows tasks Having multiple build VMs would make it more likely that the blocking build step is completed before any other WineTest task. This is also why it's good that vm2 is not too busy.
* Reducing the number of test configurations :-(
On 5/4/22 09:11, Francois Gouget wrote:
- Speed up the slow tests, potentially by using multi-threading. What sucks is we have no way of tracking which tests are slow, which test configurations are slow, etc. It would be nice to have something like the patterns page but for runtime (and also for the tests output size).
This seems worth looking into. There's also a lot of tests that can't really be improved by multithreading *internally*, but also don't touch global state and hence could be run in parallel with anything else. We could construct a whitelist (or maybe there's even enough to construct a blacklist instead) of tests that winetest can run in parallel.
- Getting hardware with faster single thread performance: over 90% of the tests are single-threaded. vm2 is meant to be the first step towards this.
On Wed, 4 May 2022, Zebediah Figura (she/her) wrote: [...]
This seems worth looking into.
Given that almost all the CPU performance gains come from high core counts nowadays I agree that it would be nice. But...
There's also a lot of tests that can't really be improved by multithreading *internally*, but also don't touch global state and hence could be run in parallel with anything else. We could construct a whitelist (or maybe there's even enough to construct a blacklist instead) of tests that winetest can run in parallel.
I don't think there's a way to automatically detect which test units can be run in parallel or even to have a heuristic that reliably identifies a subset that are safe to parallelize. (and reciprocally for a whitelist)
That means we'd need a handcrafted whitelist or blacklist and I'm not sure how maintainable that would be: - A blacklist has the drawback that we'd always be playing catchup to add new tests. - But I'm not even sure a whitelist would work better: any patch to a whitelisted test may require evicting it from the whitelist.
On 5/5/22 10:37, Francois Gouget wrote:
On Wed, 4 May 2022, Zebediah Figura (she/her) wrote: [...]
This seems worth looking into.
Given that almost all the CPU performance gains come from high core counts nowadays I agree that it would be nice. But...
There's also a lot of tests that can't really be improved by multithreading *internally*, but also don't touch global state and hence could be run in parallel with anything else. We could construct a whitelist (or maybe there's even enough to construct a blacklist instead) of tests that winetest can run in parallel.
I don't think there's a way to automatically detect which test units can be run in parallel or even to have a heuristic that reliably identifies a subset that are safe to parallelize. (and reciprocally for a whitelist)
That means we'd need a handcrafted whitelist or blacklist and I'm not sure how maintainable that would be:
- A blacklist has the drawback that we'd always be playing catchup to add new tests.
- But I'm not even sure a whitelist would work better: any patch to a whitelisted test may require evicting it from the whitelist.
Obviously it'd have to be maintained manually, but my idea is that we can work towards a blacklist, and try to make it as small as possible, probably by gradually making tests parallelizable, and then by adding the requirement that new changes to tests avoid breaking that. Which I don't think is an unreasonable requirement to have or enforce.
Ultimately I don't think it'd be that bad, either.
Off the top of my head, tests I can think of that inherently can't be parallelized:
* MSI installation tests (msi:action and msi:install, although not msi:db and msi:format); Windows only allows one installer to be run at once.
* dinput and ntoskrnl tests, I think? Probably also setupapi? I'm not sure these couldn't be made independent of each other, but it's probably easier not to try.
* Tests which change display mode (some ddraw, d3d8, d3d9, dxgi, user32:sysparams). In many cases these are put into test units with other d3d tests which *are* parallelizable, but they could be split out. Although, that said:
* d3d tests in general are an odd case. We can't parallelize them if we might run out of GPU memory, although that hasn't been a concern yet and it won't be for llvmpipe. We also can't parallelize them on nouveau because of its threading problems. There are also a few tests that *shouldn't* break other tests but do because of driver bugs.
* Tests which care about the foreground window. In practice this includes some user32, d3d, dinput tests, probably others. Often it's only a couple of tests functions out of the whole test. (I wonder if we could improve things by creating custom window stations or desktops in many cases?)
* Tests which warp the cursor or depend on cursor position. This ends up being about the same set.
A quick skim doesn't bring up any other clear cases of tests that can't be parallelized. There are probably still a lot that need auditing and perhaps extra work to ensure that they can be parallelized, but I think that's work worth doing.
There are a decent number of tests that hardcode temporary file paths but could be made to use GetTempFileName() instead. Actually most such tests already use GetTempFileName(), I guess in order to be robust.
There's also a lot of tests that do touch global state, e.g. write to the registry, but don't write to parts of the registry that any other tests should be reading. advapi32:registry is one such example. Such tests can't run in parallel with *themselves*, but could be run in parallel with anything else.
(Stuff like services.exe or advapi32:service might fall into the same boat. These do touch global state, but in theory other tests shouldn't care that we're e.g. creating an advapi32 test service. Definitely easier just to blacklist those, though, at least to start with...)
One other thing that occurs to me while writing this is that instead of using a blacklist in winetest.exe, we could use global (win32) mutex objects in the relevant tests. That would also allow us to separate tests which can't be paralellized only with a set group of other tests (and run e.g. msi:action and d3d8:device at the same time), as well as have finer grained control than blacklisting a whole test file (e.g. we could grab WINETEST_DISPLAY_MODE_MUTEX around only the d3d8:device tests that mess with the display mode).
On Thu, 5 May 2022, Zebediah Figura (she/her) wrote: [...]
Off the top of my head, tests I can think of that inherently can't be parallelized:
[...]
- Tests which change display mode (some ddraw, d3d8, d3d9, dxgi,
user32:sysparams). In many cases these are put into test units with other d3d tests which *are* parallelizable, but they could be split out.
I would add user32:monitor.
[...]
- d3d tests in general are an odd case. We can't parallelize them if we might
run out of GPU memory, although that hasn't been a concern yet and it won't be for llvmpipe.
Do we really use that much GPU memory?
We also can't parallelize them on nouveau because of its threading problems. There are also a few tests that *shouldn't* break other tests but do because of driver bugs.
The resolution change tests always leave my monitor in a weird resolution like 320x200 when it's not 200x320 (portait mode). It's always fun in the morning to find a terminal to issue an xrandr -s 0. But I suspect the first WineTest (win32) run may break the next WineTest run (wow64) whenever a test tries to open a window that does not fit in that weird desktop resolution. I suspect comctl32:combo, header, rebar, status and toolbar are among the impacted tests. (so I'm now trying to inject an xrandr in between runs)
All that to say that if the resolution change tests run in parallel or at a somewhat random time relative to the other tests that may bring more variability and unexpected failureto the results.
- Tests which care about the foreground window. In practice this includes some
user32, d3d, dinput tests, probably others. Often it's only a couple of tests functions out of the whole test. (I wonder if we could improve things by creating custom window stations or desktops in many cases?)
- Tests which warp the cursor or depend on cursor position. This ends up being
about the same set.
I may be wrong but I suspect this should include most of comctl32, comdlg32, user32:edit, and probably others.
A quick skim doesn't bring up any other clear cases of tests that can't be parallelized. There are probably still a lot that need auditing and perhaps extra work to ensure that they can be parallelized, but I think that's work worth doing.
There's also all the timing issues in sound, locking (timeout aspects) and timer tests.
There are a decent number of tests that hardcode temporary file paths but could be made to use GetTempFileName() instead. Actually most such tests already use GetTempFileName(), I guess in order to be robust.
Eh, funny you should say that. I just found out that kernelbase:process and lz32:lzexpand_main forgot to do that (bug 52970). But yes, easily fixable.
But overall I'm more skeptical about the feasibility of parallelization. For instance the w10pro6v2004 and w10pro64 test configurations had a background Windows Update causing failures in msi:msi and msi:package. So far quite understandable. But that also caused reproducible failures in ieframe:webbrowser, kernel32:resource, shell32:shlfileop, urlmon:url, wininet:http and wininet:urlcache (bug 52560). That's kind of wide ranging and unexpected.
Maybe it could work by only letting one test to run in each of very very broad categories (maybe that's similar to your CS idea): * screen : anything that opens a window or modifies the screen d3d*, user32*, gdi32*, etc. * sound : anything that plays or captures sound dsound, winmm, mmdevapi, etc. * timing : anything sensitive to timing dsound, winmm, mmdevapi, kernel32:sync, etc. * install msi*, ntoskrnl*, more? * others : anything not in any of the above categories
But even such a scheme would probably allow msi:msi to run in parallel with urlmon:url and bug 52560 seems to indicate that would not be a good idea.
Also I'm not sure we'd have much parallelism left with such a scheme (i.e. too much complexity for too little gain?).
But also maybe the only way to know is to try.
On 5/7/22 11:44, Francois Gouget wrote:
On Thu, 5 May 2022, Zebediah Figura (she/her) wrote: [...]
Off the top of my head, tests I can think of that inherently can't be parallelized:
[...]
- Tests which change display mode (some ddraw, d3d8, d3d9, dxgi,
user32:sysparams). In many cases these are put into test units with other d3d tests which *are* parallelizable, but they could be split out.
I would add user32:monitor.
Yeah, sorry, that was an approximate list. I should have said that this list is incomplete...
[...]
- d3d tests in general are an odd case. We can't parallelize them if we might
run out of GPU memory, although that hasn't been a concern yet and it won't be for llvmpipe.
Do we really use that much GPU memory?
I think no. It occurred to me because we *have* run out of virtual address space, but that's much less available.
In concrete terms, with the way things currently are, tests *shouldn't* use more than 128 MiB (per thread), so I don't think it's worth worrying about.
We also can't parallelize them on nouveau because of its threading problems. There are also a few tests that *shouldn't* break other tests but do because of driver bugs.
The resolution change tests always leave my monitor in a weird resolution like 320x200 when it's not 200x320 (portait mode). It's always fun in the morning to find a terminal to issue an xrandr -s 0. But I suspect the first WineTest (win32) run may break the next WineTest run (wow64) whenever a test tries to open a window that does not fit in that weird desktop resolution. I suspect comctl32:combo, header, rebar, status and toolbar are among the impacted tests. (so I'm now trying to inject an xrandr in between runs)
All that to say that if the resolution change tests run in parallel or at a somewhat random time relative to the other tests that may bring more variability and unexpected failureto the results.
Yeah, to be clear, I think it's a good idea to run all of the resolution changing tests separately, and not put effort into parallelizing to the absolute limit.
There's a lot of cases here where I'm thinking about an "ideal" final state (e.g. there's no reason why shlwapi:url can't run at the same time as user32:monitor), but that's just brainstorming...
- Tests which care about the foreground window. In practice this includes some
user32, d3d, dinput tests, probably others. Often it's only a couple of tests functions out of the whole test. (I wonder if we could improve things by creating custom window stations or desktops in many cases?)
- Tests which warp the cursor or depend on cursor position. This ends up being
about the same set.
I may be wrong but I suspect this should include most of comctl32, comdlg32, user32:edit, and probably others.
I'm not too familiar with the controls tests, but that sounds plausible.
A quick skim doesn't bring up any other clear cases of tests that can't be parallelized. There are probably still a lot that need auditing and perhaps extra work to ensure that they can be parallelized, but I think that's work worth doing.
There's also all the timing issues in sound, locking (timeout aspects) and timer tests.
Hmm, I guess that means we should let anything that calls timeBeginPeriod() run by itself. In practice that only seems to be winmm:timer and mmdevapi:spatialaudio? Maybe you're referring to something else I'm missing?
But it's not obvious to me that e.g. quartz:dsoundrender can't run in parallel with dsound:*. As far as I understand there's no "exclusive access" problems, and they shouldn't mess with each other's timers? I don't claim to be that familiar with low-level audio though, so maybe there's something I'm missing.
There are a decent number of tests that hardcode temporary file paths but could be made to use GetTempFileName() instead. Actually most such tests already use GetTempFileName(), I guess in order to be robust.
Eh, funny you should say that. I just found out that kernelbase:process and lz32:lzexpand_main forgot to do that (bug 52970). But yes, easily fixable.
But overall I'm more skeptical about the feasibility of parallelization. For instance the w10pro6v2004 and w10pro64 test configurations had a background Windows Update causing failures in msi:msi and msi:package. So far quite understandable. But that also caused reproducible failures in ieframe:webbrowser, kernel32:resource, shell32:shlfileop, urlmon:url, wininet:http and wininet:urlcache (bug 52560). That's kind of wide ranging and unexpected.
Maybe it could work by only letting one test to run in each of very very broad categories (maybe that's similar to your CS idea):
- screen : anything that opens a window or modifies the screen d3d*, user32*, gdi32*, etc.
- sound : anything that plays or captures sound dsound, winmm, mmdevapi, etc.
- timing : anything sensitive to timing dsound, winmm, mmdevapi, kernel32:sync, etc.
- install msi*, ntoskrnl*, more?
- others : anything not in any of the above categories
But even such a scheme would probably allow msi:msi to run in parallel with urlmon:url and bug 52560 seems to indicate that would not be a good idea.
Hmm. Of the eight tests mentioned there, two are related to installers, and four are related to... hitting the internet data cap? Which doesn't sound related to msi per se. (Not sure what's up with kernel32:resource and shell32:shlfileop.)
Ultimately, though, we *are* going to run into spurious and non-obvious failures when we run apparently unrelated tests in parallel. That's inevitable. I'm optimistic that it won't be that many, although I have nothing to base that optimism on. But if it ends up being bad I think we can give up on it.
Also I'm not sure we'd have much parallelism left with such a scheme (i.e. too much complexity for too little gain?).
By number of test units, I think there's a lot that can be parallelized. That might not translate to time, though. Unfortunately the longest-running tests fall mostly in the exceptions listed above (msi, d3d, user32...), so maybe it won't help much.
But also maybe the only way to know is to try.
Indeed :-)
I won't promise that I'll have time to put together a proof of concept, since I can't find the time to do much of anything, but I'll at least try to make time...
On Sat, 7 May 2022, Zebediah Figura (she/her) wrote: [...]
There's also all the timing issues in sound, locking (timeout aspects) and timer tests.
Hmm, I guess that means we should let anything that calls timeBeginPeriod() run by itself. In practice that only seems to be winmm:timer and mmdevapi:spatialaudio? Maybe you're referring to something else I'm missing?
mmdevapi:capture and mmdevapi:render have timing issues too. But there are also a number of places where we wait for something for a short time and may get a timeout due to the system being busy.
[...]
Hmm. Of the eight tests mentioned there, two are related to installers, and four are related to... hitting the internet data cap?
The data cap is typically 10 GB and I set it when I'm done configuring the VM. So the tests shouldn't hit the cap.
[...]
By number of test units, I think there's a lot that can be parallelized. That might not translate to time, though. Unfortunately the longest-running tests fall mostly in the exceptions listed above (msi, d3d, user32...), so maybe it won't help much.
Based on somewhat old data, we have about 700 test units, 160 of which take more than 1 second, 110 of which take more than 2 seconds, 70 of which take more than 5 seconds, 45 of which take more than 10 seconds, 30 of which take more than 20 seconds, 10 of which take more than 50 seconds.
The repartition probably did not change too much.
Hi François!
On 5/4/22 16:11, Francois Gouget wrote:
On Thu, 28 Apr 2022, Zhiyi Zhang wrote: [...]
Could you make them go faster? Maybe balancing the load a bit or adding more hardware?
The issue is that WineTest takes time and new jobs have to wait for running tasks to complete to get their turn. But then they have priority over WineTest.
I collected some data about the WineTest tasks (see attached spreadsheet) and they take between 25 minutes on Windows and 35 minutes on Linux. The main issue here is VMs that have many test configurations which must therefore be run sequentially. The three VMs with the longest chains are:
Time Configs VM 6.7 h 11 debian11 6.7 h 16 w1064 6.8 h 15 w10pro64
What this means is that no amount of rebalancing can get the tests to run in less than about 7 hours.
And here are the results at the VM host level:
Time Configs Host 7.2 h 12 vm1 1.4 h 3 vm2 7.7 h 18 vm3 12.1 h 25 vm4
The issue is vm2 is too slow and old to run most VMs nowadays. So moving some test configurations from vm4 to vm1 or vm3 will push those to 9 / 10 hours. So I'll restart the process of getting new hardware to replace vm2.
The other options:
Fix the tests that get stuck: they waste 2 minutes each. But it looks like there's only two of those left, conhost.exe:tty and wscript.exe:run, so there's not much to gain.
Speed up the slow tests, potentially by using multi-threading. What sucks is we have no way of tracking which tests are slow, which test configurations are slow, etc. It would be nice to have something like the patterns page but for runtime (and also for the tests output size).
Getting hardware with faster single thread performance: over 90% of the tests are single-threaded. vm2 is meant to be the first step towards this.
Splitting the VMs with many test configurations so the test load can be spread across multiple hosts. That is, instead of having a single VM with 15 test configurations that must run sequentially like w10pro64, have two VMs with 7 and 8 configurations each that can run in parallel. But that makes an extra VM to manage and requires having hosts to spread them to :-(
Load balancing could help, assuming the TestBot is smart enough.
That is, if it starts by running the debiant and w7u tasks on vm4, then by the time the other hosts are idle all that's left to run is w10pro64's 15 test configurations that must be run sequentially anyway. So the scheduler must give priority to the VMs with the highest count of pending tasks.
Load balancing could help reduce the latency by ensuring the builds are done earlier. Here's a worst case scenario right now: t=0 vm2 starts a WineTest job t=1 Developper submits a job. First comes the build step t=25 vm2 completes the WineTest job t=25 vm1, vm3 and vm4 each start a new WineTest job t=26 vm2 completes the developer's build task t=50 vm1, vm3 and vm4 complete their WineTest task t=51 vm1, vm3 and vm4 starts the developer's Windows tasks Having multiple build VMs would make it more likely that the blocking build step is completed before any other WineTest task. This is also why it's good that vm2 is not too busy.
Reducing the number of test configurations :-(
Thanks indeed for the tesbot, it's been really useful, so much that I'm sometimes a bit worried of over-using it.
There's one thing that I've been wondering for a bit, and that I believe have an impact on the occupation:
When a patch is submitted that is detected as potentially touching more than a single test, all the tests for the module are queued for testing. However this isn't done through WineTest, and instead they are all queued and tested separately, at least on the Windows VMs.
Wouldn't it be better to always run the tests through WineTest, and make it run all the tests that need checking at once?
For instance https://testbot.winehq.org/JobDetails.pl?Key=113995, which had to run all the dinput tests took apparently ~1h to complete (if I trust the elapsed time compared to the job from the patch before it, as I believe it is somehow cumulative). The total time for the individual tests is more about ~25min, but I think the VM cleanup time adds up a lot.
I also think that it would make it easier to reproduce some of the failures that the nightly runs suffer from, and which are related to tests badly interacting with each other.
Another question, unrelated to the performance problems, could we consider adding more Desktop/WM environments to the Debian VMs? I think it could be interesting to have to track down winex11 bugs, though it's probably likely to have several broken tests.
I also intend at some point, when win32u conversion will be more settled, to finish sending my nulldrv patches, and I think it'd be nice to have a testbot flavor that could be configured to use it instead of the default graphics driver.
I believe it mostly works already, except for the user32:monitor and user32:sysparams tests, but it needs some registry changes to use it. It could be interesting to have to make the user32/win32u tests more reliable.
Cheers,
On Wed, 4 May 2022, Rémi Bernon wrote: [...]
When a patch is submitted that is detected as potentially touching more than a single test, all the tests for the module are queued for testing. However this isn't done through WineTest, and instead they are all queued and tested separately, at least on the Windows VMs.
Wouldn't it be better to always run the tests through WineTest, and make it run all the tests that need checking at once?
That would certainly be more efficient time-wise.
Network-wise there's the issue that WineTest.exe is big because it always contains all the tests so it would cause more network traffic. But the traffic issue is probably minor and it is probably possible to tweak the builds to reduce the size.
But as you mentionned, the main issue is that the tests could interfere with each other which so far has been regarded as "polluting" the results. But we could see things differently.
Another question, unrelated to the performance problems, could we consider adding more Desktop/WM environments to the Debian VMs? I think it could be interesting to have to track down winex11 bugs, though it's probably likely to have several broken tests.
So far the main goal has been to avoid failures so the desktop environment has been optimized with that in mind (so fvwm with a carefully crafted configuration).
But again things have changed since then. Most importantly the TestBot can now distinguish old failures from new ones and I'm still working towards having a way to prevent the "always new" failures from causing false positives.
With those two in place running the tests in configurations known to cause failures is less of an issue.
One way support for multiple desktop environments could be done in the current framework would be to have one Linux VM per desktop environment. However that means compiling once per test environment which has an impact on performance. With a fast new server (or servers) that could work though.
The alternative would be to install multiple desktop environments in the same Debian VM (easy) and have the client-side TestBot script switch from one desktop environment to another based on the configuration to test. I'm not sure how that would work though.
I also intend at some point, when win32u conversion will be more settled, to finish sending my nulldrv patches, and I think it'd be nice to have a testbot flavor that could be configured to use it instead of the default graphics driver.
It sounds like that's just a matter of configuring the test environment to use nulldrv instead of the regular graphics driver (including possibly unsetting $DISPLAY). So that would be a bit like setting the locale and could probably be done through the missions mechanism without requiring a separate test environment.
On 5/5/22 17:27, Francois Gouget wrote:
On Wed, 4 May 2022, Rémi Bernon wrote: [...]
When a patch is submitted that is detected as potentially touching more than a single test, all the tests for the module are queued for testing. However this isn't done through WineTest, and instead they are all queued and tested separately, at least on the Windows VMs.
Wouldn't it be better to always run the tests through WineTest, and make it run all the tests that need checking at once?
That would certainly be more efficient time-wise.
Network-wise there's the issue that WineTest.exe is big because it always contains all the tests so it would cause more network traffic. But the traffic issue is probably minor and it is probably possible to tweak the builds to reduce the size.
Yes, winetest.exe is large (77Mo here), but I think it can compress well. A zstd version is ~12Mo, xz is ~9Mo. Still 5-10x larger than individual test executables but if you count the overhead of copying these tests for every subtests to run, it may not be so much of a difference anymore.
But as you mentionned, the main issue is that the tests could interfere with each other which so far has been regarded as "polluting" the results. But we could see things differently.
I agree that it may be problematic, but it also means that we would perhaps have less of these failures in the nightly builds if they get caught early.
It'd also be more easy to debug, as sending a patch touching two module tests would be enough to run the two tests at once and debug combined issues. Whereas right now I think you have to upload winetest yourself and run the right command-line.
It's also only going to be causing problems if you run some combinations of tests, and most of the time only one test is run at once, or all the tests for a single module, which should have less weird interactions.
Another question, unrelated to the performance problems, could we consider adding more Desktop/WM environments to the Debian VMs? I think it could be interesting to have to track down winex11 bugs, though it's probably likely to have several broken tests.
So far the main goal has been to avoid failures so the desktop environment has been optimized with that in mind (so fvwm with a carefully crafted configuration).
But again things have changed since then. Most importantly the TestBot can now distinguish old failures from new ones and I'm still working towards having a way to prevent the "always new" failures from causing false positives.
With those two in place running the tests in configurations known to cause failures is less of an issue.
One way support for multiple desktop environments could be done in the current framework would be to have one Linux VM per desktop environment. However that means compiling once per test environment which has an impact on performance. With a fast new server (or servers) that could work though.
The alternative would be to install multiple desktop environments in the same Debian VM (easy) and have the client-side TestBot script switch from one desktop environment to another based on the configuration to test. I'm not sure how that would work though.
I think it's safer to use multiple VMs. And it would let us test desktop environments in their vanilla flavor, which is imho what we should try to make work best. I think mixing or switching desktop environments often ends up with undesired side effects.
I also intend at some point, when win32u conversion will be more settled, to finish sending my nulldrv patches, and I think it'd be nice to have a testbot flavor that could be configured to use it instead of the default graphics driver.
It sounds like that's just a matter of configuring the test environment to use nulldrv instead of the regular graphics driver (including possibly unsetting $DISPLAY). So that would be a bit like setting the locale and could probably be done through the missions mechanism without requiring a separate test environment.
Yeah I don't know how the prefix preparation is done. Right now there's no environment variable to control the driver, and unsetting DISPLAY was considered as not great as it could hide a genuine user mistake without complaining.
* Test regressions When I identify a commit that causes new test failures I create a bug report with the regression keyword and corresponding commit id. But it can be hard to notice for developers who don't subscribe to the wine-bugs mailing list.
So I will now also add the patch's author to the bug's CC field.
* w10pro64 vs windows.media.speech:speech The 2022-04-25 w10pro64 upgrade fixed some speech test issues but it turns out that's only because the tests are now being skipped after the initialization fails with an internal error.
To help debug the issue I restored a pre-upgrade version of w10pro64 from backup and made it available as w10pro64restored and also made the pre-upgrade w10pro64 test configuration available as w10pro64old.
See bug 52981 for more details: https://bugs.winehq.org/show_bug.cgi?id=52981
* New test signing configurations I added w7u_tsign and w864_tsign. They are respectively identical to w7u and w864 but have test signing turned on. With w1064_tsign that covers all major Windows versions. If needed I could add test signing configurations to older Windows versions too.
For now the wine-devel patches are not tested against these new test signing configurations in order to not add more load on the TestBot.
* debiant I added the fonts-arphic-uming package on the Debian Testing VM. This avoids a couple of failures (mostly because the tests use the UMing font instead of 'VL Gothic'), and also avoids a skip in case one runs the tests in the Chinese locale (when you submit a job on a Wine VM you can specify the locale you want but by default patches are only tested in the Chinese locale on debian11).
While I was at it I performed an update and it looks like the worst of the 32- vs 64-bit package conflicts have been solved.
There has only been one test run since the update but so far it looks like the update caused no new failure.
I will do the same on the debian11 VM when I find a good time to take it down for maintenance.
* Rebalancing I also moved the debiant VM to the vm3 host. According to my spreadsheet this should allow the vm4 WineTest runs to complete 2 hours earlier (12 -> 10 h) but of course vm3 will take 2 extra hours (7 -> 9 hours).
On 11/05/2022 16:22, Francois Gouget wrote:
Test regressions When I identify a commit that causes new test failures I create a bug report with the regression keyword and corresponding commit id. But it can be hard to notice for developers who don't subscribe to the wine-bugs mailing list.
So I will now also add the patch's author to the bug's CC field.
I was under the impression this was standard practice, for any regressions actually. At the very least the commit author needs to check whether it is in fact a problem with their patch or not...
Anyway I certainly hope I didn't miss some of those regressions (which are still unfixed?) for not being CC'd. :-/
On Wed, 11 May 2022, Gabriel Ivăncescu wrote: [...]
Anyway I certainly hope I didn't miss some of those regressions (which are still unfixed?) for not being CC'd. :-/
I went through the past bugs and added CCs where appropriate. There was only one for you.
One more point is that there was still some issues in the w10pro64 locale test configurations.
Specifically the UserDefaultLCID, ThreadLocale and Country settings were still wrong for most locales (i.e. still set to en-US). I thought I checked those when I recreated these locales on 2022-04-25 but I found no "Locale check" job around that date.
So I assumed the incorrect settings were caused by the SetWinLocale issue I fixed on 2022-05-02 and I forced the TestBot to recreate all the locale live snapshots using the latest SetWinLocale.
Luckily I think that now it's all good: https://testbot.winehq.org/JobDetails.pl?Key=114459
On Wed, 11 May 2022, Francois Gouget wrote: [...]
So I assumed the incorrect settings were caused by the SetWinLocale issue I fixed on 2022-05-02 and I forced the TestBot to recreate all the locale live snapshots using the latest SetWinLocale.
Luckily I think that now it's all good: https://testbot.winehq.org/JobDetails.pl?Key=114459
And it's all wrong again: https://testbot.winehq.org/JobDetails.pl?Key=114664
The exact same live snapshots that were fine before.
It looks like there is something that reverts the locales when the VM is restarted. Maybe it's a task that runs when we update the VM's time.
This is such a pain :-(
Does anyone know what's going on?