On Mon, 25 Jun 2007, Stefan Dösinger wrote: [...]
One issue I see is actually interpreting the results. When is a performance drop large enough to be a problem? Sometimes a change will slightly reduce performance for some applications, but significantly improve it for others.
Thats a good point, and I've been thinking about it a bit. All we can use is some heuristics. Throwing a warning on a single 0.01 fps drop is an overkill because there are minor differences between runs, even with exactly the same code. On the other hand, if a number of 0.5 fps drops sum up to a drop from 80 to 75 fps it is something that should perhaps be investigated, so we should check against a fixed reference value rather than the last day's result.
I don't think we can automatically flag performance drops as regressions / errors.
First you would have to compare the performance of the current run against that of one of the previous runs. This runs counter to the norm which is that each run is self contained and completely independent of the other runs.
Second, there is no garantee that all users running the nightly tests will have an otherwise idle machine. For instance I run cxtests every night but on some days my machine may still be finishing a backup, or recording a movie, or compiling wine with gcc 2.95, etc. Each of those could cause significant performance variations that could be mistaken for a performance bug.
Also, if you only compare one run to the next you will only detect a performance drop on one day, which can easily be dismissed as noise due to the above, and everything will be normal from then on. So you have to do a more long term analysis.
So I think a more promising approach is to just collect data and generate graphs. Then we can have human beings look at the graphs and investigate when something looks weird, like if the performance on 90% of the machines drops around the same date.
You could also write some software to automatically analyze this data on the server but that seems overkill.