Hello folks, my name is Zebediah Figura, and I'll be taking over for Jeremy White today.
I'd like to start out by thanking everyone who has been working on fixing tests, with special notice to Alex Henrie who has fixed more Windows tests recently than probably anyone else combined. This mail however is not about Windows at all but Wine.
Item one: it seems to me—perhaps naïvely so, but it seems to me that we are closer than ever to having an entirely green machine (well, orange really) on the Linux end—so much so that I'm going to run through the failures that are left. Please do take the time to at least skim, since in some cases this is an RFC and what to do to fix a test isn't exactly clear even though the cause is. I also have a few other comments, so I'd appreciate your reading through to the end of this mail.
So, without further ado:
---
* kernel32:change is due to the asynchronous nature of inotify. How should this be fixed? inotify gives us one creation event and two or more notifications on the parent. It's not clear how to fix this in the test. It certainly doesn't seem possible to make the server behave like Windows.
* ole32:clipboard is explained and fixed by 140671 [1].
* quartz:filtergraph is a number of problems related to winegstreamer teardown. I have a set of patches in my local tree which should clear up most to all of these, and I'd like to start submitting these soon.
* shell32:shelllink needs some discussion, I think. The problem is that winemenubuilder is holding the file when we check for leaks. On Windows I would imagine all menu-building would be done synchronously, but on Wine this is implausible since we need to create our own copy of the link icon possibly after the link is created. At one point I suggested that menu-building should occur on prefix shutdown, which, while not really possible to do on top of the Windows API, might nevertheless be the best solution.
* urlmon:protocol is explained and fixed by 121083 [2]. This solution is less clean than is ideal, but my further attempts to synchronize more cleanly were met with other failures. I'm willing to take another look at this, although I'd appreciate help.
* user32:msg is now down to one consistent failure. This is explained and fixed by 140924/140925 [3]/[4].
* user32:win, perhaps one of the most notorious offenders, has three consistent failures left, and possibly some more intermittent ones. - Lines 3364-5 (test_SetForegroundWindow) and 9940-4 (test_LockWindowUpdate) were determined by jwhite to be the fault of the hwndMain test window being minimized during the test. I further investigated and traced this down, surprisingly enough, to a bug in fvwm, which I've reported at [5]. The bug involves a lot of steps to trigger, notably, and we can (and, I think, should) work around it by modifying the test machine configuration: either by changing the focus mode from MouseFocus, or by removing the RaiseTransient or StackTransientParent style. - Line 6838 (test_EnableWindow) is explained and fixed by 141063 [6].
* wininet:ftp, which affects both Linux and Windows, is a recent failure, as can be seen at [7]. But nothing was changed in the tests at that point. It seems more likely that something changed with respect to the ftp.winehq.org server itself, some time on or before 3 January. Could someone with knowledge of wininet or of the FTP server be implored to look into this?
Then there are the following failures. I haven't investigated any of these at all, partly because I'm as yet completely unfamiliar with the respective APIs:
* A number of Direct3D failures, all (?) of which seem to be dependent on the driver used. I'm not going to enumerate them all here, but I will make note of the fact that cw1 only needs d3d10core:device, d3d11:d3d11, and d3d9:visual, and cw2 only needs d3d9:device. All of these are consistent. * mmdevapi:capture, intermittent and rare. * mmdevapi:render, intermittent but less rare. * mshtml:htmldoc, intermittent but almost consistent. * urlmon:misc, intermittent. * xaudio2_7:xaudio2, intermittent.
While I'm certainly willing to fix every last failure myself if need be, just think how much easier it would be if people who were familiar with the code took a look at it!
Then there's one very weird set of failures: every D3D-related test from ddraw:ddraw2 onward (i.e. ddraw, dxdiagn, dxgi), on cw1-hd6800-wow32—and only that architecture—fails to print its summary line. This has been the case for since at least early November (unfortunately I lost the exact date when this started happening, but it wasn't obviously the result of any commit).
And, finally, the 64-bit failures: * d3dx9_36:math is slightly off on the spherical harmonics values due to the differences in FP math on 64-bit. Here's inviting someone who knows the math to check all of our numbers and make sure they're correct, or otherwise do as Alex Henrie did for spherical light [8]. All others are due to lack of typelib marshaling: * ieframe:ie * msxml3:domdoc * oleaut32:tmarshal * shell32:shelldispatch * user32:dialog (indirectly, since msxml3:domdoc crashes and leaves the winedbg window open)
Perhaps in these cases the best thing to do is to mark those todo_wine_if (sizeof (void *) == 8), since properly implementing the 64-bit marshaller takes more work than anyone is willing to do.
---
The ultimate goal, of course, is that we finally get a way to run testbot patches on [ideally multiple] Linux machines. Nobody needs me to tell them how important this is, but of course it's also important that we start from a clean slate, since these tests are going to be run for *every* patch, not just those that modify the tests. So while Linux is in a better state than some Windows versions at this point, it seems to me that we really do want a wholly green machine.
Along these lines it's worth noting, by the way, that except for the D3D failures, there seems to be no difference between cw1 and cw2. It seems to me it would be a good thing if we started testing multiple window managers as well (and in particular we may want to test those which see common use such as KWin and Mutter).
Also along these lines: presumably setting up the testbot to run linux tests is going to require a fair amount of work, and that work is going to be implied to be François' responsibility. I'd just like to say that I'm willing to help as much as I can, though.
Okay, so, all that said, we have another problem. I've tried running winetest on my own machine recently, in the interest of rooting out some of the rarer failures, and I've found that the reports it generates are consistently about 1.7 MB long, which is of course over the 1.5 MB limit that test.winehq.org accepts. I did some sed'ing and came up with the worst offenders as being:
test bytes % of report d3d11:d3d11 286861 16.27 ws2_32:sock 108672 6.17 kernel32:virtual 75963 4.29 gdi32:font 64700 3.61 user32:win 62114 3.52 shell32:shlexec 41235 2.34 user32:msg 37126 2.11
This isn't, I assert, something wrong with my setup. The D3D team's obviously admirable predisposition toward verbosity, and the need to print a large number of floating-point values, combined with a total of 2167 tests marked todo, has taken over almost one-sixth of the report. But even with d3d11 removed, the report just barely sneaks in under the limit, at 1475812 bytes. One might blame an excess of trace lines, as last time, but in all reality, I am inclined to think that 1.5 MB is no longer a reasonable limit for a test report. If there is an inherent limitation brought on by server storage space or processing power, well, so be it, but if not we probably will want to either raise it—or otherwise start investing energy into getting rid of existing todos.
Alright, that's finally it from me, I promise. At least for now ;-)
[1] https://source.winehq.org/patches/data/140671 [2] https://www.winehq.org/pipermail/wine-devel/2018-January/121083.html [3] https://source.winehq.org/patches/data/140924 [4] https://source.winehq.org/patches/data/140925 [5] https://github.com/fvwmorg/fvwm/issues/51 [6] https://source.winehq.org/patches/data/141063 [7] http://test.winehq.org/data/tests/wininet:ftp.html [8] https://source.winehq.org/git/wine.git/commit/07f7022fccc1c51eb4ccb4b6735e76...
2018-02-01 19:10 GMT-07:00 Zebediah Figura z.figura12@gmail.com:
While I'm certainly willing to fix every last failure myself if need be, just think how much easier it would be if people who were familiar with the code took a look at it!
Oh gosh. It's like a hunger strike, only you're threatening to lose your sanity :-O
- d3dx9_36:math is slightly off on the spherical harmonics values due to
the differences in FP math on 64-bit. Here's inviting someone who knows the math to check all of our numbers and make sure they're correct, or otherwise do as Alex Henrie did for spherical light [8].
With the spherical light tests, I both corrected the numbers and relaxed the test stringency, because Windows errs a little in one direction and Wine errs a little more in the other direction. I might be able to fix the spherical harmonic tests if someone points me towards an understandable introduction to spherical harmonics in computer science.
Perhaps in these cases the best thing to do is to mark those todo_wine_if (sizeof (void *) == 8), since properly implementing the 64-bit marshaller takes more work than anyone is willing to do.
This is a very good idea. How about putting this in wine/test.h:
#ifdef _WIN64 #define todo_wine64 todo_wine #else #define todo_wine64 #endif
-Alex
On 2 February 2018 at 07:08, Alex Henrie alexhenrie24@gmail.com wrote:
With the spherical light tests, I both corrected the numbers and relaxed the test stringency, because Windows errs a little in one direction and Wine errs a little more in the other direction. I might be able to fix the spherical harmonic tests if someone points me towards an understandable introduction to spherical harmonics in computer science.
Understandable may be a relative term, but Peter-Pike Sloan wrote a paper called "Stupid Spherical Harmonics (SH) Tricks" [1] a while back. You may also find [2] interesting.
[1] http://www.ppsloan.org/publications/StupidSH36.pdf [2] https://blogs.msdn.microsoft.com/chuckw/2012/07/28/spherical-harmonics-math/
Zebediah Figura z.figura12@gmail.com writes:
The ultimate goal, of course, is that we finally get a way to run testbot patches on [ideally multiple] Linux machines. Nobody needs me to tell them how important this is, but of course it's also important that we start from a clean slate, since these tests are going to be run for *every* patch, not just those that modify the tests. So while Linux is in a better state than some Windows versions at this point, it seems to me that we really do want a wholly green machine.
While we do want a green machine, note that this is not required for starting to run the tests on Linux. The testbot is able to filter known errors out, and we can ignore the spurious ones like we do for Windows tests.
Also along these lines: presumably setting up the testbot to run linux tests is going to require a fair amount of work, and that work is going to be implied to be François' responsibility. I'd just like to say that I'm willing to help as much as I can, though.
It would be great if you could work on that. As far as I'm concerned this should be the top priority for the testbot, but it has not been getting any traction for years.
This isn't, I assert, something wrong with my setup. The D3D team's obviously admirable predisposition toward verbosity, and the need to print a large number of floating-point values, combined with a total of 2167 tests marked todo, has taken over almost one-sixth of the report. But even with d3d11 removed, the report just barely sneaks in under the limit, at 1475812 bytes. One might blame an excess of trace lines, as last time, but in all reality, I am inclined to think that 1.5 MB is no longer a reasonable limit for a test report. If there is an inherent limitation brought on by server storage space or processing power, well, so be it, but if not we probably will want to either raise it—or otherwise start investing energy into getting rid of existing todos.
While we can bump the limit a little, we still need to invest energy into reducing verbosity. If we instead increase the limit every time some test runs over, we'll reach 100MB in a couple of years, and then we'll be in serious trouble.
On Thu, 1 Feb 2018, Zebediah Figura wrote: [...]
- user32:win, perhaps one of the most notorious offenders, has three
consistent failures left, and possibly some more intermittent ones.
- Lines 3364-5 (test_SetForegroundWindow) and 9940-4
(test_LockWindowUpdate) were determined by jwhite to be the fault of the hwndMain test window being minimized during the test. I further investigated and traced this down, surprisingly enough, to a bug in fvwm, which I've reported at [5]. The bug involves a lot of steps to trigger, notably, and we can (and, I think, should) work around it by modifying the test machine configuration: either by changing the focus mode from MouseFocus, or by removing the RaiseTransient or StackTransientParent style. [5] https://github.com/fvwmorg/fvwm/issues/51
I'm not sure what to do for this. Here's the fvwm configuration of the cw1 and cw2 machines:
$ cat ~/.fvwm/config Style * ClickToFocus DesktopSize 1x1
The rest comes from the standard Debian configuration file: /usr/share/fvwm/default-config/config
So I think that means MouseFocus should not be an issue. RaiseTransient however seems to be controlled by StackTransientParent which the Debian configuration file sets as follows:
# Transient Windows (such as open file windows) Style * DecorateTransient, StackTransientParent Style * !FPGrabFocusTransient, FPReleaseFocusTransient
What setting do you recommend?
Also along these lines: presumably setting up the testbot to run linux tests is going to require a fair amount of work, and that work is going to be implied to be François' responsibility. I'd just like to say that I'm willing to help as much as I can, though.
Okay, so, all that said, we have another problem. I've tried running winetest on my own machine recently, in the interest of rooting out some of the rarer failures, and I've found that the reports it generates are consistently about 1.7 MB long, which is of course over the 1.5 MB limit that test.winehq.org accepts. I did some sed'ing and came up with the worst offenders as being:
test.winehq.org actually gobles up surprising amounts of disk space.
On my test machine I have 977 reports using up 7.2GB. I estimate that test.winehq.org has 2100 reports (54 reports times 39 days), thus using around 16GB.
Then there's the old reports that are no longer online but are still lurking in old-data. I have no idea why we even keep those. On my machine they use another 7.7GB so maybe that's another 16GB on test.winehq.org.
At least these estimates are quite a bit lower than those I came up with in april. I think we have fewer reports per day now (54 instead of 60), and my archives are much smaller too (but I don't remember if it's because I tweaked some local setting). There's more discussion on the whole test.winehq.org subject there: https://www.winehq.org/pipermail/wine-devel/2017-April/117162.html
Anyway, we might have to increase the limit at some point, but keeping the log sizes reasonable is important too. So I sent a patch that marks tests that output more than 32KB of traces as failed on test.winehq.org in order to push for a review / refactoring of those tests. If the patch is accepted I can send the same one for the TestBot so patches that bloat the traces are caught early.
On 11/02/18 18:11, Francois Gouget wrote:
On Thu, 1 Feb 2018, Zebediah Figura wrote: [...]
- user32:win, perhaps one of the most notorious offenders, has three
consistent failures left, and possibly some more intermittent ones.
- Lines 3364-5 (test_SetForegroundWindow) and 9940-4
(test_LockWindowUpdate) were determined by jwhite to be the fault of the hwndMain test window being minimized during the test. I further investigated and traced this down, surprisingly enough, to a bug in fvwm, which I've reported at [5]. The bug involves a lot of steps to trigger, notably, and we can (and, I think, should) work around it by modifying the test machine configuration: either by changing the focus mode from MouseFocus, or by removing the RaiseTransient or StackTransientParent style. [5] https://github.com/fvwmorg/fvwm/issues/51
I'm not sure what to do for this. Here's the fvwm configuration of the cw1 and cw2 machines:
$ cat ~/.fvwm/config Style * ClickToFocus DesktopSize 1x1
The rest comes from the standard Debian configuration file: /usr/share/fvwm/default-config/config
So I think that means MouseFocus should not be an issue. RaiseTransient however seems to be controlled by StackTransientParent which the Debian configuration file sets as follows:
# Transient Windows (such as open file windows) Style * DecorateTransient, StackTransientParent Style * !FPGrabFocusTransient, FPReleaseFocusTransient
What setting do you recommend?
Hmm, I just realized I proscribed contradictory settings... I guess there's some reason other than mouse-focus mode why the user32:msg window doesn't activate. But the solution should be the same regardless.
Anyway, with regard to user32:win, I would either set a different focus mode, or add
Style * DontStackTransientParent
to override the previous setting.
Thanks!
On Sun, 11 Feb 2018, Zebediah Figura wrote: [...]
$ cat ~/.fvwm/config Style * ClickToFocus DesktopSize 1x1
The rest comes from the standard Debian configuration file: /usr/share/fvwm/default-config/config
First a correction: ~/.fvwm/config totally overrides the default fvwm configuration. So the above is the entirety of the fvwm configuration. Maybe that's not enough?
[...]
or add
Style * DontStackTransientParent
to override the previous setting.
Ok. I did a bunch of tests with user32:msg and DontStackTransientParent seems to help. Without it I get 23 systematic errors and with it the test succeeds most of the time with only 4 to 9 failures in one or two runs out of 10. So here is the newi ~/.fvwm/config:
--- # The default fvwm config is pretty long and has tons of features of # dubious relevance for a test environment: # - It has a sidebar which may be causing the ole32:dragdrop test to fail. # - The virtual desktops are just an annoyance when accessing the machine # through VNC. # So this configuration file completely replaces the default one and is # built from scratch with only the settings that have been deemed necessary.
# fvwm's default SloppyFocus policy does not match the Windows behavior and # causes a lot of failures. So use ClickToFocus instead which best matches the # Windows behavior. Style * ClickToFocus
# DontStackTransientParent is needed to avoid some failures in user32:msg. # See https://www.winehq.org/pipermail/wine-devel/2018-February/122021.html Style * DontStackTransientParent
# DecorateTransient does not seem to be needed for the tests but gives a proper # border to transient Windows like File Open dialogs. This makes them much more # usable and better matches the Windows behavior. Style * DecorateTransient
# Don't use virtual desktops to avoid aggravation when the mouse goes out of the # VNC window. DesktopSize 1x1 ---
Let me know how it goes.
Fun fact: Depending on the fvwm settings, clicking on the File menu in Gimp has a 50% chance of crashing x11vnc!
On 12/02/18 07:46, Francois Gouget wrote:
# DontStackTransientParent is needed to avoid some failures in user32:msg. # See https://www.winehq.org/pipermail/wine-devel/2018-February/122021.html Style * DontStackTransientParent
Excellent, many thanks. (Although, as a nitpick, it's user32:win, not user32:msg.)
Francois Gouget fgouget@codeweavers.com writes:
Then there's the old reports that are no longer online but are still lurking in old-data. I have no idea why we even keep those. On my machine they use another 7.7GB so maybe that's another 16GB on test.winehq.org.
There had been a few cases in the past where we had to dig through the old data to investigate problems, but I agree that we could get rid of it nowadays. Currently it takes up 64GB on winehq.
On 11/02/18 18:11, Francois Gouget wrote:
Anyway, we might have to increase the limit at some point, but keeping the log sizes reasonable is important too. So I sent a patch that marks tests that output more than 32KB of traces as failed on test.winehq.org in order to push for a review / refactoring of those tests. If the patch is accepted I can send the same one for the TestBot so patches that bloat the traces are caught early.
Apologies if this has been proposed before, but would it perhaps make sense to run the tests with -q, so traces and todos are filtered out?
On Tue, 13 Feb 2018, Zebediah Figura wrote:
On 11/02/18 18:11, Francois Gouget wrote:
Anyway, we might have to increase the limit at some point, but keeping the log sizes reasonable is important too. So I sent a patch that marks tests that output more than 32KB of traces as failed on test.winehq.org in order to push for a review / refactoring of those tests. If the patch is accepted I can send the same one for the TestBot so patches that bloat the traces are caught early.
Apologies if this has been proposed before, but would it perhaps make sense to run the tests with -q, so traces and todos are filtered out?
That would make the tests harder to debug, particularly for intermittent failures and non-TestBot results where it's not as easy as just re-running the test.
Maybe a compromise would be to have WineTest filter out the traces if the test succeeds. I'd be wary of relying on just re-running the test with traces on, again because of intermittent failures and because tests that fail may cause the next run to fail differently.
But even so we may still lose too much as comparing the traces on a successful test with those of a failing one can yield insight into where and how they are diverging.
On 15 February 2018 at 15:08, Francois Gouget fgouget@codeweavers.com wrote:
On Tue, 13 Feb 2018, Zebediah Figura wrote:
Apologies if this has been proposed before, but would it perhaps make sense to run the tests with -q, so traces and todos are filtered out?
That would make the tests harder to debug, particularly for intermittent failures and non-TestBot results where it's not as easy as just re-running the test.
I agree for traces, but the case for todo_wine is less clear to me.
On Thu, 1 Feb 2018, Zebediah Figura wrote: [...]
Also along these lines: presumably setting up the testbot to run linux tests is going to require a fair amount of work, and that work is going to be implied to be François' responsibility. I'd just like to say that I'm willing to help as much as I can, though.
Here is a description of how we could get a very basic implementation together: (base series)
b1. We need a new VM type so the TestBot knows which VMs to run the Wine tests on (see VMs.pm, ddl/winetestbot.sql and ddl/update*.sql). We could call it something like 'unix' or 'wine'.
b2. There are multiple types of Tasks and each one is handled by a WineRunXxx.pl script. The script being used depends on the Step->Type field so we will need new Step types (see Steps.pm and the same ddl files).
b3. CheckForWinetestUpdate will need to be updated to create a job with a Step of the right type, for instance unixreconfig, to update Wine on the Unix machine(s) when there is a Git commit in Wine.
b4. To process the tasks in the new unixreconfig step type we will need two new scripts: WineRunUnixReconfig.pl and build/UnixReconfig.pl. note that they should *not* run WineTest: WineRunUnixReconfig.pl will have to update the VM snapshot so later compilations start from the new Wine baseline. But running WineTest could ruin the VM's test environment (crash the X server, etc) which we would not want in the snapshot we'll use to run later tests.
Now that we have proper dependency support between Steps we could also add the UnixReconfig Step as an extra Step in the usual "Wine Update" job and just make sure it does not depend on the classic Reconfig step so that a failure in one does not cause the other to be skipped. The choice is a mater of taste.
b5. We will need a new script to run the tests in the Unix VMs. Maybe call it WineRunUnixTask.pl. Unlike WineRunTask.pl which runs the tests on Windows, WineRunUnixTask.pl will need to deal with both patches and executables.
b6. Modify Patches::Submit() to create tasks for the unix/wine VMs. Currently it only creates jobs for the patches that modify the tests. If a patch series contains test and non-test parts it combines them one job which suits our purpose just fine. So for a basic implementation we could keep that as is.
Then Patches::Submit() creates one job per dll that it needs to run tests on. So if a patch touches the tests of d3d8 and d3d9 that's 2 jobs. In fact this should be changed because it does not mesh well with https://source.winehq.org/patches/ which expects precisely one job per patch.
So assuming the above is fixed, if we get a patch that touches the device and visual unit tests in d3d8 and d3d9 we will currently get a job that looks something like this (here indentation represents the dependencies between steps): 1 Build d3d8_test.exe and d3d8_test64.exe 2 d3d8:device - 32 bit exe - 1 task per 32/64 bit Windows VM 3 d3d8:device - 64 bit exe - 1 task per 64 bit Windows VM 4 d3d8:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 5 d3d8:visual - 64 bit exe - 1 task per 64 bit Windows VM 6 Build d3d9_test.exe and d3d9_test64.exe 7 d3d9:device - 32 bit exe - 1 task per 32/64 bit Windows VM 8 d3d9:device - 64 bit exe - 1 task per 64 bit Windows VM 9 d3d9:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 10 d3d9:visual - 64 bit exe - 1 task per 64 bit Windows VM
The simplest approach would be to add the unix/wine tests as a single extra step that does the build and runs all the test units. 1 Build d3d8_test.exe and d3d8_test64.exe 2 d3d8:device - 32 bit exe - 1 task per 32/64 bit Windows VM 3 d3d8:device - 64 bit exe - 1 task per 64 bit Windows VM 4 d3d8:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 5 d3d8:visual - 64 bit exe - 1 task per 64 bit Windows VM 6 Build d3d9_test.exe and d3d9_test64.exe 7 d3d9:device - 32 bit exe - 1 task per 32/64 bit Windows VM 8 d3d9:device - 64 bit exe - 1 task per 64 bit Windows VM 9 d3d9:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 10 d3d9:visual - 64 bit exe - 1 task per 64 bit Windows VM 11 All test units - all bitness - 1 task per Unix VM `-> run d3d8:device d3d8:visual d3d9:device d3d9:visual
For the unix step the TestBot would either send the patch or test executable and provide the list of test units to run. Of course that means only testing the patches that modify the tests. Also you'll notice there's no mention of the 32 bit vs WoW wineprefix distinction. We'd just run all the relevant tests, 32 bit, 32 bit Wow, and 64 bit WoW in the same task.
As I said it's the simplest approach but probably not what we want. I'll discuss alternatives below.
b7. In addition to the WineRunUnixReconfig.pl changes, CheckForWinetestUpdate needs to be updated to create WineRunUnixTask.pl tasks to run the official WineTest executables just like we currently do on Windows. These could get tacked on the existing jobs or go into a separate job like the 'Other VMs' job.
b8. Last but not least, create one or more Unix VM to run the tests on, with all the development packages and the right Window manager and settings.
Note that this would be separate from the standard build VM in part because both would need different Type fields. But also the current build VM that generates the Windows executables for the Windows tests uses MinGW and does not need any of the native Unix development libraries. This means it rarely breaks or needs updates when the Wine build dependencies change, unlike the new unix test VMs which will more likely need regular updates.
b9. We normally give 2 minutes for the Windows tasks to run. However they only run a single test unit each whereas the unix tests will need to run many test units so that 2 minutes will be too short.
- The 2 minutes is in part to make sure the tests don't take too long to run (that limit matches the WineTest.exe limit), and in part to make sure nasty patchs sent to wine-devel don't get to use our infrastructure for too long. So there is some value to keep it as short as possible.
- For the full test suite we currently have a 30 minutes timeout. So in theory we could simply add 2 minutes per test until we reach the 30 minutes limit. Past 3 test units that seems overly generous though. So we could do something a bit more sophisticated like 2 minutes for the first 3 test units and 30 seconds after that up to the time limit.
- The exact algorithm probably does not matter much as long as we don't get spurious timeouts. It should also be easy to adjust independently of the rest of the code.
Now the above is quite limited and not really what we want so let's see what's missing and what the impact of adding it has.
First one of the things we want is to check that the test patches sent to wine-devel compile. This can be built on top of the above. (compilation series)
c1. The first change is to modify the code that creates the jobs, Patches::Submit(), so it does not ignore patches that don't touch the tests.
c2. When a patch touches a test it can work exactly like above, but in addition it should create jobs for the other patches. These new jobs would only contain a single unix step of the same form as the tests above but with an empty list of test units to run.
c3. When given an empty list of test units to run the UnixTask.pl script should still perform the build but skip running the tests.
c4. The WineRunUnixTask.pl script should not complain if there only a build log and no test log. A missing log file may be hard to distinguish from a bug so this may require putting some special code like "Test log intentionally left blank" in the log so the script knows everything is fine.
But what we really want is to also test non test Wine patches sent to wine-devel so we can make sure they don't break the tests. Here's a starting point for that: (dll tests series)
d1. Whenever a test touches a dll, rerun all the tests in that dll. This is a simple extension of the compilation series where instead of passing an empty test unit list for non-test patches we either pass a list of all that dll's test unit (like we currently do for the non-C patches), or an empty list if there is no test for that dll.
This will make checking some patches a bit slow, particularly for some dlls that have a lot of test units like mshtml for instance. But the TestBot should be able to handle the extra load.
d2. For patches that don't modify a specific dll (or program) we'd just pass an empty list.
However a patch that modifies a dll such as ntdll.dll for instance could essentially break any test. So what we really, really want is to run a much more complete range of tests for these patches. (all tests series)
a1. The simplest approach would be to tweak the above code to systematically rerun every test (or WineTest) for every patch. As before this would include the 32 bit, 32 bit Wow and 64 bit Wow set of tests.
a2. In theory this task's timeout should be 3 * ($ReconfigTimeout + $SuiteTimeout). With the default values that would be 3 hours which seems way too large.
a3. A Wine rebuild takes 3.5 minutes on average and running the full WineTest suite takes about 19 minutes (see the TestBot statistics). Even taking into account these averages means a job would take at least 1 hour. This has to be compared to the rate at which jobs arrive which, excluding every non-test patch, currently stands at about 1.3 job per hour.
So this very unlikely to be sustainable.
How to make it possible is not entirely clear and below I'll investigate various options. (options)
o1. The simplest option for reducing the load would be to reduce the number of bitnesses we test. For instance we could drop the 32 bit tests in favor of the 32 bit WoW ones. Or, in a more radical move, drop everything but the basic 32 bit tests. This means we would miss some failures but if that's rare enough it may be acceptable. This would still leave us with tasks that take about 22 minutes so this may not be sufficient either.
o2. The approach that BuildBot takes is to test multiple patches together. If multiple patches arrive in a 5 minutes window (or before the previous testing round is finished) it can put them all together and only start one new test round.
- We could conceivably do something like this for the TestBot but it would require pretty big changes to the way it operates. For instance the 'patches' website assumes there is one job per patch. But here this would not be the case.
- When a job fails we would not know which patch caused the failure. We could solve that by doing a 'bisect' of sorts but then this compounds the complexity so it's probably not the right approach.
- It also would not work for manually submitted jobs.
- And then there's the question of how it would mesh with the existing Windows tests: bunch them up too? Keep them separate and end up with multiple jobs per patch?
- So overall this really does not seem like an approach we could use in the TestBot.
o3. Another option would be to split the work so it can be parallelized: have one step that only does the 32 bit tests on one of the VMs (1 build + 1 WineTest run so 22 minutes expected run time), and another step that does the 32 and 64 bit WoW tests in parallel on another VM (running on another host). But that second VM would still have an expected run time of about 45 minutes so that would likely be insufficient.
o4. We could instead separate the build step from the test one(s). This would allow us to run the 32 bit, 32 bit Wow and 64 bit Wow tests in parallel in separate VMs, meaning we should be able to handle about 3 jobs per hour.
- This option can be tempting if we have multiple test environments. For instance it could allow us to build once and then run the tests in GNOME, KDE and LXDE environments for instance (although none of these have window managers that are up to the task so maybe a better exemple would be to run the same binaries in English, French and Hebrew locales).
- Also while this would work fine for running the tests multiple times on the same Linux distribution, we would probably not want to build on Debian and then run the tests on Red Hat. So this may be of limited usefulness.
- Furthermore if we have really different types of Unix systems such as Linux and FreeBSD (or Mac assuming it can fit in this framework at all), we would need a way to make sure that a binary built on Linux is then not sent to a FreeBSD test machine. One approach could be to create more VM types (have linux and freebsd instead of just unix, and handle both with the same scripts), but that seems to lead to an explosion in the number of VM types. Especially if we then go with Debian, Red Hat, Arch, etc. It's certainly possible to find a solution though (VM subtypes?).
- Another drawback is that this requires transferring the Wine and test binaries from the build VM to the TestBot server and then to each of the test VMs. This would be about 20MB compressed per bitness for every job. This adds to the time spent running each Task, to the WineHQ.org bandwidth consumption and the TestBot disk usage.
- An optimisation would be to let the 'Wine update' job catch up all the test VMs to the latest Wine binaries to establish a new baseline, and to then only send the binary changes.
- The simplest form of diff would be to only send the modified files (based on the modified timestamp), but there are probably a lot of binary diff tools we could also use.
- The main issue with all these 'diff' approaches is keeping the baseline of the build VM and the test VMs in sync. We run the risk of having a sequence such as: 1 Generate a binary diff for job 1. 2 Update the build VM. 3 Synchronize the test VM with the new binary baseline. 4 Try to apply the diff generated in 1 to the test VM's new binaries.
o5. Instead we could replicate the Unix VM(s). So instead of having the host running the single Unix VM be the bottleneck, we could throw more hardware at the issue to increase our job processing throughput.
- This assumes that the test VM really behaves exactly the same way no matter which host it is on otherwise the results could be pretty confusing. That should already be the case but our hosts are not entirely identical (3 Intel processors, 1 AMD but most VMs use a neutral 'kvm32' processor) and this has never really be thoroughly verified.
- Note that although the job throughput would be increased, the job latency would still remain at the usual
- See bug 39412 for (upcoming) details on how the TestBot could implement load balancing between the hosts. https://bugs.winehq.org/show_bug.cgi?id=39412
- The further benefit is that failover would likely come for free. This means that with enough VM duplication, when one host freezes the tasks would automatically be handled by the other hosts which would mean less (or no) downtime.
o6. A more sophisticated approach would be to analyze the dlls the tests depend on and only rerun those that can be impacted.
- For instance it looks like modifying the ws2_32 source would only impact the secur32, webservices, winhttp, ws2_32 and wsdapi tests, which means running 21 test units instead of the 500+ of the full suite. Also changing the source in a dll with no test and not used anywhere else (e.g. hal.dll) means we could skip running the tests entirely.
- This analysis could be done on the VM right after rebuilding Wine: see which binaries changed, then look for them in the Makefile IMPORTS lines. But this would miss tests and dlls that do a LoadLibrary(). So a more sophisticated analysis may be called for, or we may need to provide the extra dependency information manually somehow (either in the Wine source or in some TestBot table).
- This would mean the unix tasks would determine which test units to run on their own rather than having the TestBot tell them as proposed in the dll tests series. But the change should be minor.
- It's hard to predict how much this would reduce processing time and thus whether it would be sufficient. This really depends on the ratio of low-level header / dll patches versus high level dll with no tests patches. So the only way to see if that work is probably to try it out.
- If not sufficient on its own it can be combined with other options, particularly the load balancing one (o5). It may also be easier to implement than o5, depending on how hard the dependency analysis turns out to be.
Given the above I think it makes sense to start implement things progressively from the base series to the compilation on, to the dll one to the all series. Each step will be able to build on the previous one and provide us with new information about the extra TestBot load, how many jobs we really get per hour, and also whether the test results are reliable, etc.
Should we approach the TestBot limits at any point we'll be able to stop expanding the tests and still have more than what we currently have while we figure out how to proceed further. I also think that no matter what happens and where we stop, the work done will be reused when proceeding further.
Further bells and whistles: (whistles)
w1. The current tasks only do one thing and produce a single log. For some approches a single task may do a build, run 32 bit tests, then 64 bit tests. Having all this go into a single log may be confusing when looking at it on the web site. So it could make sense to create a separate log for each 'subtask'.
- This would require modifying the WineRunXxx.pl and associated build scripts of course, as well as the the JobDetails.pl web page so it can show each log; but also the code canceling and restarting Tasks so they clean the old logs correctly.
- The JobDetails.pl page could have "Show full 32 bit log" and "Show full 64 bit log" links for instance.
- Note also that we already support showing the results of multiple test units from a single log for WineTest. So that aspect should not require any change.
w2. The job submission page lets developers cross a checkbox to run the 64 bit Windows tests in addition to the 32 bit ones. If we do have 3 types of tests on Unix we may want to expand that. This may also further require analyzing the whether the user picked a Unix VM to present the right option.
w3. The current Windows tests reset the test environment entirely between test units. The Unix tests would not. It's not clear that we should. After all WineTest needs to be able to run all the tests with no cleanup between them. Still it may make sense to do a cleanup between bitnesses, at least resetting the WIKNEPREFIX. But doing a more thorough cleanup would involve at least restarting the X server between tests. That could be bothersome.