Hi,
I spent a few hours debugging wined3d performance today. No, I found no magic fix for the slowness, just some semi-usable data.
First I wrote a hacky patch to avoid redundant FBO applications. This gave a tiny, tiny performance increase, see http://www.winehq.org/pipermail/wine- devel/2011-April/089832.html.
The main investigation concerned redundant shader applications. The aim was to find out how many of our glBindProgramARB calls are re-binding the same program, and how much this costs. Depending on the game between 20% to 90% of all BindProgram calls are redundant. I'll attach my debug hack so others can test their own apps. I used ARB shaders for testing because they can apply vertex and fragment programs separately.
This brings up two questions: (a) How much does this cost (b) Why does this happen
The costs: In my draw overhead tester hacking out the redundant apply calls improved performance a lot, from about 101 fps to 157 fps. The biggest part of that are the GL calls. Without them but the remaining shader logic I get 144 fps.
Unfortunately this does not translate to any performance gains in real apps. I tried to filter out the redundant apply calls in the simplest way possible: Track the current value per wined3d_context and check before calling glBindProgramARB. This gave the 144 fps in the draw overhead tester, but no measurable increase in any other apps(I tested StarCraft 2, HL2, Team Fortress 2, World in Conflict and a few others)
Given the amount of redundant apply calls and the cost of them in the draw overlay tester I have expected at least some improvement. Certainly not a 50% performance increase(the draw overlay tester performs no shader changes at all in the draw loop), but at least a 2-3% gain. So far I have no explanation why I didn't see that.
But why do those redundant apply calls happen? It seems like the state dirtification comes all the way from the stream sources and/or vertex declaration. STREAMSRC is linked to VDECL, which is linked to VERTEXSHADER, which in turn reapplies the pixel shader. This means redundant vertex and pixel shader applications. Separating those states will be a major challenge.
The vdecl<->vshader link shouldn't be needed any more, except in rare cases where GL_ARB_vertex_array_bgra is not supported and the application switches one attribute from D3DDECL_D3DCOLOR to a non-d3dcolor attribute. If the vertex shader changes we still have to reparse the vertex declaraion and reapply the stream sources because the vshader determines the stream numbers. Maybe we can reduce the number of times this happens by ordering stream usages and indices to make sure shaders with compatible input get the same stream ordering.
vdecl and streamsrc are pretty related. If the vdecl is changed we have to reapply the stream sources. The other way around shouldn't cause problems though. There's no need to reapply every stream except the changed ones and there's no need to reapply the vertex shader.
The vertex and pixel shader are linked for a few reasons: The shader backend API offers only a function to set both. Basic GLSL only offers a function to set both at once(GL_ARB_separate_shader_objects changes that). And even in ARB the pixel shader input may require some changes in the vertex shader output to get Shader Model 3.0 varyings right.
The shader backend API can be changed, but it has to be done in a way that doesn't hurt GLSL without ARB_separate_shader_objects. If we have classic GLSL we have to keep the link. With ARB we can conditionally reapply the vertex shader if the ps_input_signature is changed.
To complicate matters there are additional states that affect the shaders, like fog, textures, clipping. We don't keep track of those dependencies.
So it's a lot of work to clean up these state dependencies and we don't know how much it'll gain us :-(
Stefan