On Thu, 1 Feb 2018, Zebediah Figura wrote: [...]
Also along these lines: presumably setting up the testbot to run linux tests is going to require a fair amount of work, and that work is going to be implied to be François' responsibility. I'd just like to say that I'm willing to help as much as I can, though.
Here is a description of how we could get a very basic implementation together: (base series)
b1. We need a new VM type so the TestBot knows which VMs to run the Wine tests on (see VMs.pm, ddl/winetestbot.sql and ddl/update*.sql). We could call it something like 'unix' or 'wine'.
b2. There are multiple types of Tasks and each one is handled by a WineRunXxx.pl script. The script being used depends on the Step->Type field so we will need new Step types (see Steps.pm and the same ddl files).
b3. CheckForWinetestUpdate will need to be updated to create a job with a Step of the right type, for instance unixreconfig, to update Wine on the Unix machine(s) when there is a Git commit in Wine.
b4. To process the tasks in the new unixreconfig step type we will need two new scripts: WineRunUnixReconfig.pl and build/UnixReconfig.pl. note that they should *not* run WineTest: WineRunUnixReconfig.pl will have to update the VM snapshot so later compilations start from the new Wine baseline. But running WineTest could ruin the VM's test environment (crash the X server, etc) which we would not want in the snapshot we'll use to run later tests.
Now that we have proper dependency support between Steps we could also add the UnixReconfig Step as an extra Step in the usual "Wine Update" job and just make sure it does not depend on the classic Reconfig step so that a failure in one does not cause the other to be skipped. The choice is a mater of taste.
b5. We will need a new script to run the tests in the Unix VMs. Maybe call it WineRunUnixTask.pl. Unlike WineRunTask.pl which runs the tests on Windows, WineRunUnixTask.pl will need to deal with both patches and executables.
b6. Modify Patches::Submit() to create tasks for the unix/wine VMs. Currently it only creates jobs for the patches that modify the tests. If a patch series contains test and non-test parts it combines them one job which suits our purpose just fine. So for a basic implementation we could keep that as is.
Then Patches::Submit() creates one job per dll that it needs to run tests on. So if a patch touches the tests of d3d8 and d3d9 that's 2 jobs. In fact this should be changed because it does not mesh well with https://source.winehq.org/patches/ which expects precisely one job per patch.
So assuming the above is fixed, if we get a patch that touches the device and visual unit tests in d3d8 and d3d9 we will currently get a job that looks something like this (here indentation represents the dependencies between steps): 1 Build d3d8_test.exe and d3d8_test64.exe 2 d3d8:device - 32 bit exe - 1 task per 32/64 bit Windows VM 3 d3d8:device - 64 bit exe - 1 task per 64 bit Windows VM 4 d3d8:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 5 d3d8:visual - 64 bit exe - 1 task per 64 bit Windows VM 6 Build d3d9_test.exe and d3d9_test64.exe 7 d3d9:device - 32 bit exe - 1 task per 32/64 bit Windows VM 8 d3d9:device - 64 bit exe - 1 task per 64 bit Windows VM 9 d3d9:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 10 d3d9:visual - 64 bit exe - 1 task per 64 bit Windows VM
The simplest approach would be to add the unix/wine tests as a single extra step that does the build and runs all the test units. 1 Build d3d8_test.exe and d3d8_test64.exe 2 d3d8:device - 32 bit exe - 1 task per 32/64 bit Windows VM 3 d3d8:device - 64 bit exe - 1 task per 64 bit Windows VM 4 d3d8:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 5 d3d8:visual - 64 bit exe - 1 task per 64 bit Windows VM 6 Build d3d9_test.exe and d3d9_test64.exe 7 d3d9:device - 32 bit exe - 1 task per 32/64 bit Windows VM 8 d3d9:device - 64 bit exe - 1 task per 64 bit Windows VM 9 d3d9:visual - 32 bit exe - 1 task per 32/64 bit Windows VM 10 d3d9:visual - 64 bit exe - 1 task per 64 bit Windows VM 11 All test units - all bitness - 1 task per Unix VM `-> run d3d8:device d3d8:visual d3d9:device d3d9:visual
For the unix step the TestBot would either send the patch or test executable and provide the list of test units to run. Of course that means only testing the patches that modify the tests. Also you'll notice there's no mention of the 32 bit vs WoW wineprefix distinction. We'd just run all the relevant tests, 32 bit, 32 bit Wow, and 64 bit WoW in the same task.
As I said it's the simplest approach but probably not what we want. I'll discuss alternatives below.
b7. In addition to the WineRunUnixReconfig.pl changes, CheckForWinetestUpdate needs to be updated to create WineRunUnixTask.pl tasks to run the official WineTest executables just like we currently do on Windows. These could get tacked on the existing jobs or go into a separate job like the 'Other VMs' job.
b8. Last but not least, create one or more Unix VM to run the tests on, with all the development packages and the right Window manager and settings.
Note that this would be separate from the standard build VM in part because both would need different Type fields. But also the current build VM that generates the Windows executables for the Windows tests uses MinGW and does not need any of the native Unix development libraries. This means it rarely breaks or needs updates when the Wine build dependencies change, unlike the new unix test VMs which will more likely need regular updates.
b9. We normally give 2 minutes for the Windows tasks to run. However they only run a single test unit each whereas the unix tests will need to run many test units so that 2 minutes will be too short.
- The 2 minutes is in part to make sure the tests don't take too long to run (that limit matches the WineTest.exe limit), and in part to make sure nasty patchs sent to wine-devel don't get to use our infrastructure for too long. So there is some value to keep it as short as possible.
- For the full test suite we currently have a 30 minutes timeout. So in theory we could simply add 2 minutes per test until we reach the 30 minutes limit. Past 3 test units that seems overly generous though. So we could do something a bit more sophisticated like 2 minutes for the first 3 test units and 30 seconds after that up to the time limit.
- The exact algorithm probably does not matter much as long as we don't get spurious timeouts. It should also be easy to adjust independently of the rest of the code.
Now the above is quite limited and not really what we want so let's see what's missing and what the impact of adding it has.
First one of the things we want is to check that the test patches sent to wine-devel compile. This can be built on top of the above. (compilation series)
c1. The first change is to modify the code that creates the jobs, Patches::Submit(), so it does not ignore patches that don't touch the tests.
c2. When a patch touches a test it can work exactly like above, but in addition it should create jobs for the other patches. These new jobs would only contain a single unix step of the same form as the tests above but with an empty list of test units to run.
c3. When given an empty list of test units to run the UnixTask.pl script should still perform the build but skip running the tests.
c4. The WineRunUnixTask.pl script should not complain if there only a build log and no test log. A missing log file may be hard to distinguish from a bug so this may require putting some special code like "Test log intentionally left blank" in the log so the script knows everything is fine.
But what we really want is to also test non test Wine patches sent to wine-devel so we can make sure they don't break the tests. Here's a starting point for that: (dll tests series)
d1. Whenever a test touches a dll, rerun all the tests in that dll. This is a simple extension of the compilation series where instead of passing an empty test unit list for non-test patches we either pass a list of all that dll's test unit (like we currently do for the non-C patches), or an empty list if there is no test for that dll.
This will make checking some patches a bit slow, particularly for some dlls that have a lot of test units like mshtml for instance. But the TestBot should be able to handle the extra load.
d2. For patches that don't modify a specific dll (or program) we'd just pass an empty list.
However a patch that modifies a dll such as ntdll.dll for instance could essentially break any test. So what we really, really want is to run a much more complete range of tests for these patches. (all tests series)
a1. The simplest approach would be to tweak the above code to systematically rerun every test (or WineTest) for every patch. As before this would include the 32 bit, 32 bit Wow and 64 bit Wow set of tests.
a2. In theory this task's timeout should be 3 * ($ReconfigTimeout + $SuiteTimeout). With the default values that would be 3 hours which seems way too large.
a3. A Wine rebuild takes 3.5 minutes on average and running the full WineTest suite takes about 19 minutes (see the TestBot statistics). Even taking into account these averages means a job would take at least 1 hour. This has to be compared to the rate at which jobs arrive which, excluding every non-test patch, currently stands at about 1.3 job per hour.
So this very unlikely to be sustainable.
How to make it possible is not entirely clear and below I'll investigate various options. (options)
o1. The simplest option for reducing the load would be to reduce the number of bitnesses we test. For instance we could drop the 32 bit tests in favor of the 32 bit WoW ones. Or, in a more radical move, drop everything but the basic 32 bit tests. This means we would miss some failures but if that's rare enough it may be acceptable. This would still leave us with tasks that take about 22 minutes so this may not be sufficient either.
o2. The approach that BuildBot takes is to test multiple patches together. If multiple patches arrive in a 5 minutes window (or before the previous testing round is finished) it can put them all together and only start one new test round.
- We could conceivably do something like this for the TestBot but it would require pretty big changes to the way it operates. For instance the 'patches' website assumes there is one job per patch. But here this would not be the case.
- When a job fails we would not know which patch caused the failure. We could solve that by doing a 'bisect' of sorts but then this compounds the complexity so it's probably not the right approach.
- It also would not work for manually submitted jobs.
- And then there's the question of how it would mesh with the existing Windows tests: bunch them up too? Keep them separate and end up with multiple jobs per patch?
- So overall this really does not seem like an approach we could use in the TestBot.
o3. Another option would be to split the work so it can be parallelized: have one step that only does the 32 bit tests on one of the VMs (1 build + 1 WineTest run so 22 minutes expected run time), and another step that does the 32 and 64 bit WoW tests in parallel on another VM (running on another host). But that second VM would still have an expected run time of about 45 minutes so that would likely be insufficient.
o4. We could instead separate the build step from the test one(s). This would allow us to run the 32 bit, 32 bit Wow and 64 bit Wow tests in parallel in separate VMs, meaning we should be able to handle about 3 jobs per hour.
- This option can be tempting if we have multiple test environments. For instance it could allow us to build once and then run the tests in GNOME, KDE and LXDE environments for instance (although none of these have window managers that are up to the task so maybe a better exemple would be to run the same binaries in English, French and Hebrew locales).
- Also while this would work fine for running the tests multiple times on the same Linux distribution, we would probably not want to build on Debian and then run the tests on Red Hat. So this may be of limited usefulness.
- Furthermore if we have really different types of Unix systems such as Linux and FreeBSD (or Mac assuming it can fit in this framework at all), we would need a way to make sure that a binary built on Linux is then not sent to a FreeBSD test machine. One approach could be to create more VM types (have linux and freebsd instead of just unix, and handle both with the same scripts), but that seems to lead to an explosion in the number of VM types. Especially if we then go with Debian, Red Hat, Arch, etc. It's certainly possible to find a solution though (VM subtypes?).
- Another drawback is that this requires transferring the Wine and test binaries from the build VM to the TestBot server and then to each of the test VMs. This would be about 20MB compressed per bitness for every job. This adds to the time spent running each Task, to the WineHQ.org bandwidth consumption and the TestBot disk usage.
- An optimisation would be to let the 'Wine update' job catch up all the test VMs to the latest Wine binaries to establish a new baseline, and to then only send the binary changes.
- The simplest form of diff would be to only send the modified files (based on the modified timestamp), but there are probably a lot of binary diff tools we could also use.
- The main issue with all these 'diff' approaches is keeping the baseline of the build VM and the test VMs in sync. We run the risk of having a sequence such as: 1 Generate a binary diff for job 1. 2 Update the build VM. 3 Synchronize the test VM with the new binary baseline. 4 Try to apply the diff generated in 1 to the test VM's new binaries.
o5. Instead we could replicate the Unix VM(s). So instead of having the host running the single Unix VM be the bottleneck, we could throw more hardware at the issue to increase our job processing throughput.
- This assumes that the test VM really behaves exactly the same way no matter which host it is on otherwise the results could be pretty confusing. That should already be the case but our hosts are not entirely identical (3 Intel processors, 1 AMD but most VMs use a neutral 'kvm32' processor) and this has never really be thoroughly verified.
- Note that although the job throughput would be increased, the job latency would still remain at the usual
- See bug 39412 for (upcoming) details on how the TestBot could implement load balancing between the hosts. https://bugs.winehq.org/show_bug.cgi?id=39412
- The further benefit is that failover would likely come for free. This means that with enough VM duplication, when one host freezes the tasks would automatically be handled by the other hosts which would mean less (or no) downtime.
o6. A more sophisticated approach would be to analyze the dlls the tests depend on and only rerun those that can be impacted.
- For instance it looks like modifying the ws2_32 source would only impact the secur32, webservices, winhttp, ws2_32 and wsdapi tests, which means running 21 test units instead of the 500+ of the full suite. Also changing the source in a dll with no test and not used anywhere else (e.g. hal.dll) means we could skip running the tests entirely.
- This analysis could be done on the VM right after rebuilding Wine: see which binaries changed, then look for them in the Makefile IMPORTS lines. But this would miss tests and dlls that do a LoadLibrary(). So a more sophisticated analysis may be called for, or we may need to provide the extra dependency information manually somehow (either in the Wine source or in some TestBot table).
- This would mean the unix tasks would determine which test units to run on their own rather than having the TestBot tell them as proposed in the dll tests series. But the change should be minor.
- It's hard to predict how much this would reduce processing time and thus whether it would be sufficient. This really depends on the ratio of low-level header / dll patches versus high level dll with no tests patches. So the only way to see if that work is probably to try it out.
- If not sufficient on its own it can be combined with other options, particularly the load balancing one (o5). It may also be easier to implement than o5, depending on how hard the dependency analysis turns out to be.
Given the above I think it makes sense to start implement things progressively from the base series to the compilation on, to the dll one to the all series. Each step will be able to build on the previous one and provide us with new information about the extra TestBot load, how many jobs we really get per hour, and also whether the test results are reliable, etc.
Should we approach the TestBot limits at any point we'll be able to stop expanding the tests and still have more than what we currently have while we figure out how to proceed further. I also think that no matter what happens and where we stop, the work done will be reused when proceeding further.
Further bells and whistles: (whistles)
w1. The current tasks only do one thing and produce a single log. For some approches a single task may do a build, run 32 bit tests, then 64 bit tests. Having all this go into a single log may be confusing when looking at it on the web site. So it could make sense to create a separate log for each 'subtask'.
- This would require modifying the WineRunXxx.pl and associated build scripts of course, as well as the the JobDetails.pl web page so it can show each log; but also the code canceling and restarting Tasks so they clean the old logs correctly.
- The JobDetails.pl page could have "Show full 32 bit log" and "Show full 64 bit log" links for instance.
- Note also that we already support showing the results of multiple test units from a single log for WineTest. So that aspect should not require any change.
w2. The job submission page lets developers cross a checkbox to run the 64 bit Windows tests in addition to the 32 bit ones. If we do have 3 types of tests on Unix we may want to expand that. This may also further require analyzing the whether the user picked a Unix VM to present the right option.
w3. The current Windows tests reset the test environment entirely between test units. The Unix tests would not. It's not clear that we should. After all WineTest needs to be able to run all the tests with no cleanup between them. Still it may make sense to do a cleanup between bitnesses, at least resetting the WIKNEPREFIX. But doing a more thorough cleanup would involve at least restarting the X server between tests. That could be bothersome.