Hi, Here's another update.
First I expanded my performance tests at https://84.112.174.163/~git/perftest a bit. The old tests were renamned to streamsrc_d3d and streamsrc_gl, and I added another set of tests that just tests the draw overhead without ever changing any states: drawprim_d3d and drawprim_gl. Here are the performance results from Windows 7:
drawprim_gl: ~1154 fps drawprim_d3d: ~1160 fps
In Wine the D3D version gets 165.67fps fps. The Linux native GL version gets 1791 fps. The GL windows version in Wine gets about 600 fps(FIXME!). Don't worry too much about the GL performance, this is mostly locking overhead. More about that later.
I ran my usual d3d performance hacks through the d3d version. The hacks are pretty much the same as with the stremsrc test, except that I don't need the redundant vertex shader apply hacks. I attached a tarball with the hacks and a file listing their performance impact.
The plan forward is still the same: Write more of those tests(especially tests that test non-draw stuff like resource loads), improve the tests and hope that real apps profit.
The optimistic scenario is that this works out. So far we've seen slow movementin real apps with the two fixes we've made(context_validate and FBO application, the latter isn't in Wine yet). This is expected to a certain extend, because the performance is reversely proportional to the number of performance bugs we have. So we'll have to remove a lot of them before we see big movement.
The pessimistic scenario is that those tests have nothing in common with the performance bugs in real apps and the fixes only end up making the code more complex.
To that end I think I'll create a github repo where I try to get the hacks into a somewhat usable state - not commitable to wine, but good enough that they don't break apps, so they can be tested against real world apps. That way we can find out how much they really improve real games without clogging our codebase without certainty that the changes help.
Here are again some descriptions of the hacks I tested:
2) End-user business, fairly harmless. Should always be used if performance is important
3, 4) Will break stuff. Can be fixed, but would be rather ugly. Probably interesting once we run out of easier fixes
5) Could go into Wine sooner or later. Does improve real games on its own already
6) Easy to clean up, I'll send a patch today. we can skip validation if FIXMEs are off since nobody will see them.
7) I tried to find out if removing one call level helps, but it doesn't even improve this locking overhead sensitive test app. Forget about it
8) Doable, but pretty uninteresting. I doubt we'll get a noticeable improvement in a real app
9-11) Distributor / End use choice. Note that some compiler flags(especially the framepointer one) can break apps and copy protection systems.
12) Distributor / End user choice too, but harmless. Not much gain compared to WINEDEBUG=-all though
13) Doesn't improve performance a whole lot once debug msgs are compiled out.
14) We should be able to limit calls to this functions to cases where the textures were changed or vertex texture fetch is used. We may be able to eliminate it entirely when we have enough samplers available
15, 16) I caution against too much optimism here. We won't be able to get rid of the locking anytime soon. Maybe the EnterCriticalSection / LeaveCriticalSection performance can be improved. A part of the problem is call overhead, but I think the biggest issue are the locked increment and decrement operations in RtlEnterCriticalSection / RtlLeaveCriticalSection. Orig performance: 178 fps Interlocked ops replaced with normal inc/dec: 244 fps Lock calls removed from wined3d: 293 fps (this is just to give you some idea where the time is spent)
17) Forget about this one until we run out of other optimizations
18) It's interesting how much this gives without all the other optimizations. My app doesn't use any textures, so this is just the call overhead and loping over the fragment samplers.
19) My app renders to a too small window, so swapchain render_to_fbo triggers. It's interesting that getting rid of it makes performance worse
21) Removing that and other checks in drawPrimitive() barely speeds up the test. I got a total of 7-8 fps out of the compatibility or error checks in drawPrimitive, this won't show up in any real app.
Stefan