Hi, In the past days I've been actively hacking on automated performance testing. I have test scripts for Half Life 2, 3Dmark2000 and 3dmark2001 and a small z buffer test app. I am playing around with UT2004 and I want to take a look at plain old unreal tornament. I have the tests running on my old laptop and my amd 64 box. I want to extend that to my mac too. To collect some experience I've built a quick and dirty server "app", consisting of 200 lines of php, to get a better idea about what I want.
Failed bechmark attempts are Half Life 1 and the codecreatures engine benchmark. Both failed because I did not find a way to write the results to a file. In HL1 this may work by marking a bit of text with the right mouse coords and copying it to clipboard. With codecreatures we're out of luck, unless we use OCR. Final Fantasy XI has the same problem, but I knew that already.
My server side graphs can be seen here:
http://84.112.174.163/~stefan/
The amd64/ and laptop/ folders contain folders for each test, which contains a result.php which calls gnuplot to generate graphs and build a formatted table. It's all highly inflexible and not meant as a permanent solution. Adding a new test or a new computer means major copypaste and find / replace work.
Some problems I've come accross are recording cxtest scripts, which just does not work for games that need 3D in VNC. So I've written them by hand. Games for a big part also resist controllabiliy with the wait_window script. Some games just create their only window when they are started, and then take another X secounds to really start up and react to clicks or keypresses. wait_window causes a 1 secound X server freeze every few secounds, which invalidates benchmark results. For this reason I have put a sleep into the 3DMark2000 test script to freeze cxtest while the benchmark is running.
Another problem is that the performance can be quite different from run to run with the same wine version. I'm not sure what causes this, but I will try to run the tests with everything else shut down. Failing that, we can still run all tests repeatadely and find an average result.