Hi,
I spent a few hours debugging wined3d performance today. No, I found no magic fix for the slowness, just some semi-usable data.
First I wrote a hacky patch to avoid redundant FBO applications. This gave a tiny, tiny performance increase, see http://www.winehq.org/pipermail/wine- devel/2011-April/089832.html.
The main investigation concerned redundant shader applications. The aim was to find out how many of our glBindProgramARB calls are re-binding the same program, and how much this costs. Depending on the game between 20% to 90% of all BindProgram calls are redundant. I'll attach my debug hack so others can test their own apps. I used ARB shaders for testing because they can apply vertex and fragment programs separately.
This brings up two questions: (a) How much does this cost (b) Why does this happen
The costs: In my draw overhead tester hacking out the redundant apply calls improved performance a lot, from about 101 fps to 157 fps. The biggest part of that are the GL calls. Without them but the remaining shader logic I get 144 fps.
Unfortunately this does not translate to any performance gains in real apps. I tried to filter out the redundant apply calls in the simplest way possible: Track the current value per wined3d_context and check before calling glBindProgramARB. This gave the 144 fps in the draw overhead tester, but no measurable increase in any other apps(I tested StarCraft 2, HL2, Team Fortress 2, World in Conflict and a few others)
Given the amount of redundant apply calls and the cost of them in the draw overlay tester I have expected at least some improvement. Certainly not a 50% performance increase(the draw overlay tester performs no shader changes at all in the draw loop), but at least a 2-3% gain. So far I have no explanation why I didn't see that.
But why do those redundant apply calls happen? It seems like the state dirtification comes all the way from the stream sources and/or vertex declaration. STREAMSRC is linked to VDECL, which is linked to VERTEXSHADER, which in turn reapplies the pixel shader. This means redundant vertex and pixel shader applications. Separating those states will be a major challenge.
The vdecl<->vshader link shouldn't be needed any more, except in rare cases where GL_ARB_vertex_array_bgra is not supported and the application switches one attribute from D3DDECL_D3DCOLOR to a non-d3dcolor attribute. If the vertex shader changes we still have to reparse the vertex declaraion and reapply the stream sources because the vshader determines the stream numbers. Maybe we can reduce the number of times this happens by ordering stream usages and indices to make sure shaders with compatible input get the same stream ordering.
vdecl and streamsrc are pretty related. If the vdecl is changed we have to reapply the stream sources. The other way around shouldn't cause problems though. There's no need to reapply every stream except the changed ones and there's no need to reapply the vertex shader.
The vertex and pixel shader are linked for a few reasons: The shader backend API offers only a function to set both. Basic GLSL only offers a function to set both at once(GL_ARB_separate_shader_objects changes that). And even in ARB the pixel shader input may require some changes in the vertex shader output to get Shader Model 3.0 varyings right.
The shader backend API can be changed, but it has to be done in a way that doesn't hurt GLSL without ARB_separate_shader_objects. If we have classic GLSL we have to keep the link. With ARB we can conditionally reapply the vertex shader if the ps_input_signature is changed.
To complicate matters there are additional states that affect the shaders, like fog, textures, clipping. We don't keep track of those dependencies.
So it's a lot of work to clean up these state dependencies and we don't know how much it'll gain us :-(
Stefan
Hi, Here's another update.
First I expanded my performance tests at https://84.112.174.163/~git/perftest a bit. The old tests were renamned to streamsrc_d3d and streamsrc_gl, and I added another set of tests that just tests the draw overhead without ever changing any states: drawprim_d3d and drawprim_gl. Here are the performance results from Windows 7:
drawprim_gl: ~1154 fps drawprim_d3d: ~1160 fps
In Wine the D3D version gets 165.67fps fps. The Linux native GL version gets 1791 fps. The GL windows version in Wine gets about 600 fps(FIXME!). Don't worry too much about the GL performance, this is mostly locking overhead. More about that later.
I ran my usual d3d performance hacks through the d3d version. The hacks are pretty much the same as with the stremsrc test, except that I don't need the redundant vertex shader apply hacks. I attached a tarball with the hacks and a file listing their performance impact.
The plan forward is still the same: Write more of those tests(especially tests that test non-draw stuff like resource loads), improve the tests and hope that real apps profit.
The optimistic scenario is that this works out. So far we've seen slow movementin real apps with the two fixes we've made(context_validate and FBO application, the latter isn't in Wine yet). This is expected to a certain extend, because the performance is reversely proportional to the number of performance bugs we have. So we'll have to remove a lot of them before we see big movement.
The pessimistic scenario is that those tests have nothing in common with the performance bugs in real apps and the fixes only end up making the code more complex.
To that end I think I'll create a github repo where I try to get the hacks into a somewhat usable state - not commitable to wine, but good enough that they don't break apps, so they can be tested against real world apps. That way we can find out how much they really improve real games without clogging our codebase without certainty that the changes help.
Here are again some descriptions of the hacks I tested:
2) End-user business, fairly harmless. Should always be used if performance is important
3, 4) Will break stuff. Can be fixed, but would be rather ugly. Probably interesting once we run out of easier fixes
5) Could go into Wine sooner or later. Does improve real games on its own already
6) Easy to clean up, I'll send a patch today. we can skip validation if FIXMEs are off since nobody will see them.
7) I tried to find out if removing one call level helps, but it doesn't even improve this locking overhead sensitive test app. Forget about it
8) Doable, but pretty uninteresting. I doubt we'll get a noticeable improvement in a real app
9-11) Distributor / End use choice. Note that some compiler flags(especially the framepointer one) can break apps and copy protection systems.
12) Distributor / End user choice too, but harmless. Not much gain compared to WINEDEBUG=-all though
13) Doesn't improve performance a whole lot once debug msgs are compiled out.
14) We should be able to limit calls to this functions to cases where the textures were changed or vertex texture fetch is used. We may be able to eliminate it entirely when we have enough samplers available
15, 16) I caution against too much optimism here. We won't be able to get rid of the locking anytime soon. Maybe the EnterCriticalSection / LeaveCriticalSection performance can be improved. A part of the problem is call overhead, but I think the biggest issue are the locked increment and decrement operations in RtlEnterCriticalSection / RtlLeaveCriticalSection. Orig performance: 178 fps Interlocked ops replaced with normal inc/dec: 244 fps Lock calls removed from wined3d: 293 fps (this is just to give you some idea where the time is spent)
17) Forget about this one until we run out of other optimizations
18) It's interesting how much this gives without all the other optimizations. My app doesn't use any textures, so this is just the call overhead and loping over the fragment samplers.
19) My app renders to a too small window, so swapchain render_to_fbo triggers. It's interesting that getting rid of it makes performance worse
21) Removing that and other checks in drawPrimitive() barely speeds up the test. I got a total of 7-8 fps out of the compatibility or error checks in drawPrimitive, this won't show up in any real app.
Stefan
Hi Stefan,
What do you think about using inline spinlocks (in asm code maybe) to implement locks? Clearly an optimized spinlock would mean different code for different compilers/architectures, but shouldn't it be the best solution? For your reference, once I commented out the GL locks to see StarCraft 2 performance, but it crashed straight away.
What do you reckon?
Cheers,
P.s Keep up with this fantastic work! :-)
On 30/04/11 16:18, Stefan Dösinger wrote:
Hi, Here's another update.
First I expanded my performance tests at https://84.112.174.163/~git/perftest a bit. The old tests were renamned to streamsrc_d3d and streamsrc_gl, and I added another set of tests that just tests the draw overhead without ever changing any states: drawprim_d3d and drawprim_gl. Here are the performance results from Windows 7:
drawprim_gl: ~1154 fps drawprim_d3d: ~1160 fps
In Wine the D3D version gets 165.67fps fps. The Linux native GL version gets 1791 fps. The GL windows version in Wine gets about 600 fps(FIXME!). Don't worry too much about the GL performance, this is mostly locking overhead. More about that later.
I ran my usual d3d performance hacks through the d3d version. The hacks are pretty much the same as with the stremsrc test, except that I don't need the redundant vertex shader apply hacks. I attached a tarball with the hacks and a file listing their performance impact.
The plan forward is still the same: Write more of those tests(especially tests that test non-draw stuff like resource loads), improve the tests and hope that real apps profit.
The optimistic scenario is that this works out. So far we've seen slow movementin real apps with the two fixes we've made(context_validate and FBO application, the latter isn't in Wine yet). This is expected to a certain extend, because the performance is reversely proportional to the number of performance bugs we have. So we'll have to remove a lot of them before we see big movement.
The pessimistic scenario is that those tests have nothing in common with the performance bugs in real apps and the fixes only end up making the code more complex.
To that end I think I'll create a github repo where I try to get the hacks into a somewhat usable state - not commitable to wine, but good enough that they don't break apps, so they can be tested against real world apps. That way we can find out how much they really improve real games without clogging our codebase without certainty that the changes help.
Here are again some descriptions of the hacks I tested:
- End-user business, fairly harmless. Should always be used if performance is
important
3, 4) Will break stuff. Can be fixed, but would be rather ugly. Probably interesting once we run out of easier fixes
- Could go into Wine sooner or later. Does improve real games on its own
already
- Easy to clean up, I'll send a patch today. we can skip validation if FIXMEs
are off since nobody will see them.
- I tried to find out if removing one call level helps, but it doesn't even
improve this locking overhead sensitive test app. Forget about it
- Doable, but pretty uninteresting. I doubt we'll get a noticeable
improvement in a real app
9-11) Distributor / End use choice. Note that some compiler flags(especially the framepointer one) can break apps and copy protection systems.
- Distributor / End user choice too, but harmless. Not much gain compared to
WINEDEBUG=-all though
Doesn't improve performance a whole lot once debug msgs are compiled out.
We should be able to limit calls to this functions to cases where the
textures were changed or vertex texture fetch is used. We may be able to eliminate it entirely when we have enough samplers available
15, 16) I caution against too much optimism here. We won't be able to get rid of the locking anytime soon. Maybe the EnterCriticalSection / LeaveCriticalSection performance can be improved. A part of the problem is call overhead, but I think the biggest issue are the locked increment and decrement operations in RtlEnterCriticalSection / RtlLeaveCriticalSection. Orig performance: 178 fps Interlocked ops replaced with normal inc/dec: 244 fps Lock calls removed from wined3d: 293 fps (this is just to give you some idea where the time is spent)
Forget about this one until we run out of other optimizations
It's interesting how much this gives without all the other optimizations.
My app doesn't use any textures, so this is just the call overhead and loping over the fragment samplers.
- My app renders to a too small window, so swapchain render_to_fbo triggers.
It's interesting that getting rid of it makes performance worse
- Removing that and other checks in drawPrimitive() barely speeds up the
test. I got a total of 7-8 fps out of the compatibility or error checks in drawPrimitive, this won't show up in any real app.
Stefan
On Saturday 30 April 2011 18:26:04 Emanuele Oriani wrote:
Hi Stefan,
What do you think about using inline spinlocks (in asm code maybe) to implement locks? Clearly an optimized spinlock would mean different code for different compilers/architectures, but shouldn't it be the best solution?
I am usually pessimistic about hand-written assembler optimizations. You can give it a try, but compilers are pretty clever these days.
I think trying to optimize the lock calls is a more promising way. We can't simply drop the ENTER_GL/LEAVE_GL calls, as you found out in SC2. We may be able to reduce the number of those calls by moving blocks of opengl calls closer together.
There's also the wined3d lock, which is somewhat like the big kernel lock. There's room for improvement there as well, if we soften the "you must call wined3d under lock" rule. However the wined3d lock is the smaller problem compared to the X11 lock.
Indeed, I've written a spinlock with GCC extension and replaced the EnterCriticalSection in the x11 drv file. Apart that the lock has got to be recursive, so I implemented a quick (but incorrect) recursive spinlock for the purpose of running SC2 and difference was barely negligible. The biggest issue imho is that in this case we have to call a function... it would be great to inline all that code, but again, probably the best thing is to limit the number of calls. I can try a spinlock for the BKL-like which is wined3d lock. I hope this hasn't got to be recursive, right? I'm asking this because in case of a recursive lock I'm performing an extra syscall:
static volatile pid_t x11_lock = 0; static volatile int x11_lock_cnt = 0;
/*********************************************************************** * wine_tsx11_lock (X11DRV.@) */ void CDECL wine_tsx11_lock(void) { pid_t th_id = syscall(SYS_gettid); // This might be expensive! // I don't like recursive locks for this reason! while (th_id != __sync_val_compare_and_swap(&x11_lock, 0, th_id)); ++x11_lock_cnt; asm volatile("lfence" ::: "memory"); }
/*********************************************************************** * wine_tsx11_unlock (X11DRV.@) */ void CDECL wine_tsx11_unlock(void) { if(!--x11_lock_cnt) x11_lock=0; asm volatile("sfence" ::: "memory"); }
Please keep in mind this is a test code, but apparently it's working. Again, performance in case of SC2 isn't that much... but probably should test better/with other games?
Let me know, Cheers,
On 01/05/11 09:33, Stefan Dösinger wrote:
On Saturday 30 April 2011 18:26:04 Emanuele Oriani wrote:
Hi Stefan,
What do you think about using inline spinlocks (in asm code maybe) to implement locks? Clearly an optimized spinlock would mean different code for different compilers/architectures, but shouldn't it be the best solution?
I am usually pessimistic about hand-written assembler optimizations. You can give it a try, but compilers are pretty clever these days.
I think trying to optimize the lock calls is a more promising way. We can't simply drop the ENTER_GL/LEAVE_GL calls, as you found out in SC2. We may be able to reduce the number of those calls by moving blocks of opengl calls closer together.
There's also the wined3d lock, which is somewhat like the big kernel lock. There's room for improvement there as well, if we soften the "you must call wined3d under lock" rule. However the wined3d lock is the smaller problem compared to the X11 lock.
On Sunday 01 May 2011 14:34:53 Emanuele Oriani wrote:
Indeed, I've written a spinlock with GCC extension and replaced the EnterCriticalSection in the x11 drv file. Apart that the lock has got to be recursive, so I implemented a quick (but incorrect) recursive spinlock for the purpose of running SC2 and difference was barely negligible.
How much was the difference?
The biggest issue imho is that in this case we have to call a function...
I don't think so. I did some tests for the call overhead, and it is fairly small. Specifically I tried to export the wined3d lock from wined3d and call EnterCriticalSection / LeaveCritSection directly from d3d9. The difference wasn't even measurable with my hyper-sensitive self-written test apps.
I can try a spinlock for the BKL-like which is wined3d lock. I hope this hasn't got to be recursive, right? I'm asking this because in case of a recursive lock I'm performing an extra syscall:
The wined3d lock doesn't have to be recursive I think. But note that getting those changes committed into Wine are next to zero. It's more likely to get an optimization of EnterCriticalSection / LeaveCriticalSection itself into wine.
Please keep in mind this is a test code, but apparently it's working. Again, performance in case of SC2 isn't that much... but probably should test better/with other games?
No, as I explained in my mail the individual optimizations don't magically fix all the performance woes we have. We'll probably have to collect a dozen or more such little fixes to start seeing movement.
Hi, let me agree with you... probably there might be one or two fixes somewhere which will make performance go better, but apparently the idea is that now, if "early optimization is the root of all evils" was true a while a go, I guess we're at a place where all the small optimizations should take place. To confirm, SC2 performance difference without the D3D lock and the x11 spinlock, compared to current wine (1.3.19) is basically not noticeable. On top of this, be aware that apparently the D3D lock IS recursive (when using the non recursive spinlock a CPU was going 100% and application was stuck).
For example, the main issue with SC2 is that if you don't set the affinity of all threads to one core, the game will go slower by 40%; for this reason I think this game suffers of other issues related to other wine components (than D3D) and/or Linux scheduler. The main issue with SC2 is that setting different shaders level will heavily impact the game; and I'm running on a 470 GTX: this shouldn't be the case. Should we probably looking into the way we generate shaders (both GLSlang and ARB ones)? Another point, what about how we handle offscreen buffers? All FBO etc etc?
If I remember correctly someone was working on a worker thread for D3D. Did we abandon tis project? Given OpenGL doesn't (didn't?) support calls from multiple threads, should we be proceeding through this route?
Cheers,
On 01/05/11 15:10, Stefan Dösinger wrote:
On Sunday 01 May 2011 14:34:53 Emanuele Oriani wrote:
Indeed, I've written a spinlock with GCC extension and replaced the EnterCriticalSection in the x11 drv file. Apart that the lock has got to be recursive, so I implemented a quick (but incorrect) recursive spinlock for the purpose of running SC2 and difference was barely negligible.
How much was the difference?
The biggest issue imho is that in this case we have to call a function...
I don't think so. I did some tests for the call overhead, and it is fairly small. Specifically I tried to export the wined3d lock from wined3d and call EnterCriticalSection / LeaveCritSection directly from d3d9. The difference wasn't even measurable with my hyper-sensitive self-written test apps.
I can try a spinlock for the BKL-like which is wined3d lock. I hope this hasn't got to be recursive, right? I'm asking this because in case of a recursive lock I'm performing an extra syscall:
The wined3d lock doesn't have to be recursive I think. But note that getting those changes committed into Wine are next to zero. It's more likely to get an optimization of EnterCriticalSection / LeaveCriticalSection itself into wine.
Please keep in mind this is a test code, but apparently it's working. Again, performance in case of SC2 isn't that much... but probably should test better/with other games?
No, as I explained in my mail the individual optimizations don't magically fix all the performance woes we have. We'll probably have to collect a dozen or more such little fixes to start seeing movement.
On 2 May 2011 13:20, Emanuele Oriani emaentra@ngi.it wrote:
Hi, let me agree with you... probably there might be one or two fixes somewhere which will make performance go better, but apparently the idea is that now, if "early optimization is the root of all evils" was true a while a go, I guess we're at a place where all the small optimizations should take place.
There's a difference between "early" and "premature", but I guess it's subtle. I don't agree with the concept of "Anything goes as long as it improves performance for someone, somewhere."
For example, the main issue with SC2 is that if you don't set the affinity of all threads to one core, the game will go slower by 40%; for this reason I think this game suffers of other issues related to other wine components (than D3D) and/or Linux scheduler. The main issue with SC2 is that setting different shaders level will heavily impact the game; and I'm running on a 470 GTX: this shouldn't be the case. Should we probably looking into the way we generate shaders (both GLSlang and ARB ones)? Another point, what about how we handle offscreen buffers? All FBO etc etc?
Most of it really does come down to careful debugging / benchmarking, I'm afraid. There's plenty of speculation, but at some point someone just has to do the actual work.
If I remember correctly someone was working on a worker thread for D3D. Did we abandon tis project?
Not as such, but to my knowledge nobody is currently actively working on that. This is also one of those things that's a lot harder than it looks.
On Saturday 30 April 2011 17:18:54 Stefan Dösinger wrote:
9-11) Distributor / End use choice. Note that some compiler flags(especially the framepointer one) can break apps and copy protection systems.
Forget about -O3. I can't get any Windows game working with that. Apparently I am already lucky that I can get winecfg up.