Hi, In the past days I've been hacking on implementing my state management ideas, and I think I've come to a state where I don't have to be completely ashamed of my patches :-)
First, what the code does NOT do yet: * Pixel Shaders, GLSL shaders: I only had my notebook with the M9 available, so I had no chance to implement them. Expect anything from broken graphics to the sudden release of Duke Nukem Forever if you try to use them.
* Stateblocks * Register combiners: Disabled right now * Offscreen rendering: Causes random rendering garbage * 2D Blits: Commented out
I have described the basic ideas in earlier mails(http://www.winehq.org/pipermail/wine-devel/2006-October/051868.html), so I don't describe them here again. I pretty much followed the original plan.
Performance: One of the aims was to get better performance, since we apparently lost performance due to exessive state changes which eat CPU time and may require CPU-GPU syncs. My patches improve performance, but not as much as I originally hoped. I mainly have performance figures on the M9, and some basic testing on a gf7600.
* Billboard dx8 sdk demo: got from 56fps to 107 fps :-) * Half-Life 1: Quite an improvement here too. 110->150 fps in one of my timedemos. The d3d renderer now outperforms the opengl renderer(140 fps). Both the billboard demo and hl1 hit a special rendering case(no stream source or fvf changes), this is nicely optimized by my changes. The gl renderer in hl1 uses immediate mode drawing while wined3d can use VBOs and array drawing, thus beeing faster on today's cards. * Battlefield 1942: Slight improvement too, 32->37 fps on my testing scene(spawn point on a u.s. carrier@full graphics). BF1942 exceeded the usual linux/windows driver performance ratio already before, so I assume I'm pretty much at the limit of my M9 here. * 3DMark2000: Unfortunately my driver crashes it before showing the scores, so I can only watch the in-test counter. Seems to get +5 to +10 fps in the low detail helicopter test(resolution independent). Native msvcrt.dll gets another +5 fps.
I did only a short testing on my geforce7600: * 3dmark2000: gets 11500 3dmarks, with forcing drawStridedFast 14500. This is I believe the windows performance. However, the benchmark is too old to be meaningful. Before my state patches drawStridedFast scroe was around 13500 if I remember correctly, have to retest. * 3dmark2001: Low detail tests run at 150-300 fps, too fast for a meaningful result. high detail tests are slow and partially broken due to offscreen rendering. * Battlefield 1942: Runs at steady 100fps, but it did that already before
So it seems that the state patches improve one bottleneck, but we have still many others(offscreen rendering, drawStridedSlow) left. The nvidia profiling driver may help here.
Where to go from here: The state management was also planned to make implementing other features easier:
* Multithreading: Make the dirty states list per context, and the helpers stored in the device too. Before applying the states activate the correct ctx for the thread.
* Stateblocks: Basic idea is to record a display list and call it: glNewList(stateblock->listname, GL_COMPILE); for(i = 1; i <= STATE_HIGHEST; i++) { States[i].func(i, stateblock); } glEndList();
To apply the stateblock: glCallList(stateblock->listname);
Ok, we need to split the list to apply only partial states, and the for loop can be improved to create a more efficient list. When the stateblock is altered we have to recreate the list. Thats the basic idea...
* Offscreen rendering: Depends on wether we need seperate contexts for pbuffers. If yes, include it with the multithreading ctx finding, then apply the states, otherwise I think we can make selecting the pbuffer/fbo a state like all others. Has interactions with the viewport(I think) and the projection matrix(render_offscreen for upside down rendering)
* sRGB textures: Dirtifies the sampler. All textures have now information about how many samplers they are bound to, and the number of one of the samplers. Phil?
* Vertex samplers: Ivan said he'd need the state management for them. My idea is to build a d3d sampler - gl sampler mapping in SetTexture, which will be needed for register combiners too. Based on that we can bind vtf samplers in gl.
I have no clean patches right now(45 chaotic patches), so I decided to share my wined3d directory. However, this is even compressed a bit big for a mailing list, so I uploaded it to http://stud4.tuwien.ac.at/~e0526822/wined3d-statemgmt.tar.bz2
Stefan
Hi I cant compile it with actual CVS head.
drawprim.o: In function `drawPrimitive': /usr/src/wine/dlls/wined3d/drawprim.c:1696: undefined reference to `list_move' collect2: ld returned 1 exit status winegcc: gcc failed. make: *** [wined3d.dll.so] Error 2
Mirek
Stefan Dösinger napsal(a):
Hi, In the past days I've been hacking on implementing my state management ideas, and I think I've come to a state where I don't have to be completely ashamed of my patches :-)
First, what the code does NOT do yet:
- Pixel Shaders, GLSL shaders: I only had my notebook with the M9 available,
so I had no chance to implement them. Expect anything from broken graphics to the sudden release of Duke Nukem Forever if you try to use them.
- Stateblocks
- Register combiners: Disabled right now
- Offscreen rendering: Causes random rendering garbage
- 2D Blits: Commented out
I have described the basic ideas in earlier mails(http://www.winehq.org/pipermail/wine-devel/2006-October/051868.html), so I don't describe them here again. I pretty much followed the original plan.
Performance: One of the aims was to get better performance, since we apparently lost performance due to exessive state changes which eat CPU time and may require CPU-GPU syncs. My patches improve performance, but not as much as I originally hoped. I mainly have performance figures on the M9, and some basic testing on a gf7600.
- Billboard dx8 sdk demo: got from 56fps to 107 fps :-)
- Half-Life 1: Quite an improvement here too. 110->150 fps in one of my
timedemos. The d3d renderer now outperforms the opengl renderer(140 fps). Both the billboard demo and hl1 hit a special rendering case(no stream source or fvf changes), this is nicely optimized by my changes. The gl renderer in hl1 uses immediate mode drawing while wined3d can use VBOs and array drawing, thus beeing faster on today's cards.
- Battlefield 1942: Slight improvement too, 32->37 fps on my testing
scene(spawn point on a u.s. carrier@full graphics). BF1942 exceeded the usual linux/windows driver performance ratio already before, so I assume I'm pretty much at the limit of my M9 here.
- 3DMark2000: Unfortunately my driver crashes it before showing the scores, so
I can only watch the in-test counter. Seems to get +5 to +10 fps in the low detail helicopter test(resolution independent). Native msvcrt.dll gets another +5 fps.
I did only a short testing on my geforce7600:
- 3dmark2000: gets 11500 3dmarks, with forcing drawStridedFast 14500. This is
I believe the windows performance. However, the benchmark is too old to be meaningful. Before my state patches drawStridedFast scroe was around 13500 if I remember correctly, have to retest.
- 3dmark2001: Low detail tests run at 150-300 fps, too fast for a meaningful
result. high detail tests are slow and partially broken due to offscreen rendering.
- Battlefield 1942: Runs at steady 100fps, but it did that already before
So it seems that the state patches improve one bottleneck, but we have still many others(offscreen rendering, drawStridedSlow) left. The nvidia profiling driver may help here.
Where to go from here: The state management was also planned to make implementing other features easier:
- Multithreading: Make the dirty states list per context, and the helpers
stored in the device too. Before applying the states activate the correct ctx for the thread.
- Stateblocks: Basic idea is to record a display list and call it:
glNewList(stateblock->listname, GL_COMPILE); for(i = 1; i <= STATE_HIGHEST; i++) { States[i].func(i, stateblock); } glEndList();
To apply the stateblock: glCallList(stateblock->listname);
Ok, we need to split the list to apply only partial states, and the for loop can be improved to create a more efficient list. When the stateblock is altered we have to recreate the list. Thats the basic idea...
- Offscreen rendering: Depends on wether we need seperate contexts for
pbuffers. If yes, include it with the multithreading ctx finding, then apply the states, otherwise I think we can make selecting the pbuffer/fbo a state like all others. Has interactions with the viewport(I think) and the projection matrix(render_offscreen for upside down rendering)
- sRGB textures: Dirtifies the sampler. All textures have now information
about how many samplers they are bound to, and the number of one of the samplers. Phil?
- Vertex samplers: Ivan said he'd need the state management for them. My idea
is to build a d3d sampler - gl sampler mapping in SetTexture, which will be needed for register combiners too. Based on that we can bind vtf samplers in gl.
I have no clean patches right now(45 chaotic patches), so I decided to share my wined3d directory. However, this is even compressed a bit big for a mailing list, so I uploaded it to http://stud4.tuwien.ac.at/~e0526822/wined3d-statemgmt.tar.bz2
Stefan
Am Montag 27 November 2006 12:02 schrieben Sie:
Hi I cant compile it with actual CVS head.
drawprim.o: In function `drawPrimitive': /usr/src/wine/dlls/wined3d/drawprim.c:1696: undefined reference to `list_move' collect2: ld returned 1 exit status winegcc: gcc failed. make: *** [wined3d.dll.so] Error 2
Ah yeah, The list_move thing is outside of wined3d. The attached patch is needed. I think I'll send that patch to wine-patches too.
Warning, the patch causes a full wine recompile because nearly every lib uses list.h
On 27/11/06, Stefan Dösinger stefandoesinger@gmx.at wrote:
Ah yeah, The list_move thing is outside of wined3d. The attached patch is needed. I think I'll send that patch to wine-patches too.
Warning, the patch causes a full wine recompile because nearly every lib uses list.h
Shouldn't you just move the list items to the other list inside the loop?
Am Montag 27 November 2006 13:07 schrieb H. Verbeet:
On 27/11/06, Stefan Dösinger stefandoesinger@gmx.at wrote:
Ah yeah, The list_move thing is outside of wined3d. The attached patch is needed. I think I'll send that patch to wine-patches too.
Warning, the patch causes a full wine recompile because nearly every lib uses list.h
Shouldn't you just move the list items to the other list inside the loop?
Yeah, that would work too, the idea was just that adjusting pointers once to move a block of elements is faster than adjusting pointers for each element to move.
I had a very brief look at the code, pertially because a tarred up directory isn't the most convenient way to spot what has changed and what is still the same.
A few things I noticed: - markDirty() should probably either be a proper method of the device, or have a prefix - "States" is a pretty generic name, probably want to prefix that as well with something - Why are the state_* functions WINAPI? - "apply" is probably a better name than "func" in StateEntry wrt making clear what it is supposed to do.
I'm still a bit uncertain about having all states together in a single array. And I think a construction like: DWORD stage = (state - STATE_TEXTURESTAGE(0, 0)) / WINED3D_HIGHEST_TEXTURE_STATE; in tex_colorop is quite ugly.
Am Montag 27 November 2006 13:04 schrieb H. Verbeet:
I had a very brief look at the code, pertially because a tarred up directory isn't the most convenient way to spot what has changed and what is still the same.
I can of course send you my 45 patches too, but they are pretty messy because I changed things regularily and committed stuff that shouldn't be committed / forgot to commit, ... For sending the things in I have to break them up in smaller patches.
(I have uploaded the tared patches to http://stud4.tuwien.ac.at/~e0526822/statepatches.tar.bz2) The 0001 patch doesn't really belong there though.
A few things I noticed:
- markDirty() should probably either be a proper method of the
device, or have a prefix
- "States" is a pretty generic name, probably want to prefix that as
well with something
Agreed
- Why are the state_* functions WINAPI?
Doh - everything is winapi, but this is changeable of course. Any pros/cons?
- "apply" is probably a better name than "func" in StateEntry wrt
making clear what it is supposed to do.
Yeah. Those things are easilly changeable
I'm still a bit uncertain about having all states together in a single array. And I think a construction like: DWORD stage = (state - STATE_TEXTURESTAGE(0, 0)) / WINED3D_HIGHEST_TEXTURE_STATE; in tex_colorop is quite ugly.
Well, as I said the idea is to be able to group different kinds of states(e.g. LIGHTING and vertex declaration and vertex shaders). Well, that isn't as strong as I thought it would be at first, a lot of grouping is done in a more "soft" way. For example, misc_vdecl calls misc_streamsrc, and misc_streamsrc checks if STATE_VDECL is scheduled for updating before doing anything. The idea behind this is that the vertex declaration affects the resulting stream sources, but the stream sources don't affect the vdecl.
Beauty is in the eye of the beholder :-) At the end its up to AJ to decide what way to go. I personally think a single state array, and a single dirty list looks nicer :-)
Well, what I am a bit concerned about regarding AJ are the STATE_* macros in wined3d_private.h :-|
On 27/11/06, Stefan Dösinger stefandoesinger@gmx.at wrote:
Am Montag 27 November 2006 13:04 schrieb H. Verbeet:
I had a very brief look at the code, pertially because a tarred up directory isn't the most convenient way to spot what has changed and what is still the same.
I can of course send you my 45 patches too, but they are pretty messy because I changed things regularily and committed stuff that shouldn't be committed / forgot to commit, ... For sending the things in I have to break them up in smaller patches.
Sure. I do understand the reasons, just noting that I probably didn't spot everything that has changed.
- Why are the state_* functions WINAPI?
Doh - everything is winapi, but this is changeable of course. Any pros/cons?
Just that they're not actually part of any public Windows API, and I see no particular reason to change the default calling convention.
I'm still a bit uncertain about having all states together in a single array. And I think a construction like: DWORD stage = (state - STATE_TEXTURESTAGE(0, 0)) / WINED3D_HIGHEST_TEXTURE_STATE; in tex_colorop is quite ugly.
Well, as I said the idea is to be able to group different kinds of states(e.g. LIGHTING and vertex declaration and vertex shaders). Well, that isn't as strong as I thought it would be at first, a lot of grouping is done in a more "soft" way. For example, misc_vdecl calls misc_streamsrc, and misc_streamsrc checks if STATE_VDECL is scheduled for updating before doing anything. The idea behind this is that the vertex declaration affects the resulting stream sources, but the stream sources don't affect the vdecl.
Beauty is in the eye of the beholder :-) At the end its up to AJ to decide what way to go. I personally think a single state array, and a single dirty list looks nicer :-)
Well, what I am a bit concerned about regarding AJ are the STATE_* macros in wined3d_private.h :-|
I'm not directly opposed to the current setup, it's more that I'm wondering if going for separate tables for related states would perhaps result in cleaner / more flexible code. It would of course allow you to get rid of those macros :-)
Am Montag 27 November 2006 16:20 schrieben Sie:
I'm not directly opposed to the current setup, it's more that I'm wondering if going for separate tables for related states would perhaps result in cleaner / more flexible code. It would of course allow you to get rid of those macros :-)
Not so sure about that. Now there is one function for dirtifying states(markDirty). With different lists we'd need a number of simmilar, but yet different functions(markRenderStateDirty, markSamplerDirty, ...). In the same way drawPrim would have to check / work down a number of lists instead of one.
All the ugly things about the current setup are in the macros and in wasting some memory for table entries which are not supported by the hardware(like up to 128 samplers in dx10).
If no one puts a veto I'd go for the single list, and see what aj says. If I have to fall back to seperate lists well ok, its not much of an issue too.
Stefan Dösinger wrote:
- sRGB textures: Dirtifies the sampler. All textures have now information
about how many samplers they are bound to, and the number of one of the samplers. Phil?
Stefan
I still have my sRGB patch but haven't been working on it mainly just because I've been busy with work. Thanks for reminding me though, I'll be on IRC this week if I find some time ;-)
Phil