Hi, This is intended mostly for the other d3d developers, but since we have quite a number of them now so individual CCs are a lot of work :-)
I attached the patches I currently have in my tree to give an update on what I've been working on recently. The main aim of those patches is to reduce draw overhead a bit, thus improving game performance. The patches need some cleanup, but for that I first need a patch Matteo is working on.
Feedback is welcome. I'm also interested in test results, e.g. if the changes break a game, or the performance impact. If those patches cause a 5% performance increase I am happy.
Patches 1-3: Mostly unrelated. I haven't sent them yet because patch 3 breaks Unigine Heaven, and patches 1 and 2 make little sense without 3.
Patch 4: This removes a hack for a driver bug workaround. I have to do more testing on my old machines to find out if the bug is really fixed in newer nvidia drivers.
Patches 5, 6: They keep track of changes to the framebuffer setup so we don't have to run through the code that figures out which FBO to bind every draw. Patch 5 gets rid of the ordering assumption. Patch 6 applies the FBO only when needed.
They aren't ready yet. In patch 6 the FBO may have to be reapplied when the pixelshader changes. To implement that I need some draw buffer tracking infrastructure Matteo is working on. Also clears can be integrated. fbo- clear.diff is a half-baked attempt to do this. I dropped it when I realized I was duplicating Matteos work. After that I have to double-check that I took care of all situations where the FBO may have to be updated.
Furthermore, Matteo says that not calling context_apply_draw_buffers every time framebuffer() is run is a noticeable performance improvement too. Matteo, did you test this with just patch 0005, or both 0005 and 0006?
Patch 0007: Sampler map optimization, it has a lengthy description in the patch file
Patch 0008: A tiny fix, it results in a pretty small improvement on OSX. On Linux+Nvidia it is not noticeable.
Patch 0009: At first I tried to skip the render target dirtification entirely via a flag in the d3ddevice, but it was pretty tricky and ugly. Just making it cheaper gets us ~2/3rd of the way too. (Draw overhead tester performance without this: 259 fps. Complete disabling of the dirtification calls via a hack: 275. With this patch: 269)
0010: An unrelated cleanup
Patches 11, 12: Preparation for including clears in the fbo dirtification patches. See fbo-clear.diff.
More work on performance is obviously required, for example
*) Separate vertex declaration, vertex shader and pixel shader states *) Speed up sampler preloading. This will be easier once we have a tree-like state structure. *) Write more tests for other common operations, like clears, blits, shader changes, texture changes, vertex buffer changes, dynamic resource loading *) Test our shader's GPU-side execution performance *) See if we can do something about locking *) Isolate bottlenecks in the GPU drivers and get them fixed.
2011/6/6 Stefan Dösinger stefandoesinger@gmx.at: ...
Furthermore, Matteo says that not calling context_apply_draw_buffers every time framebuffer() is run is a noticeable performance improvement too. Matteo, did you test this with just patch 0005, or both 0005 and 0006?
Actually that was with your FBO dirty patches from your previous wined3d performance wine-devel email, so it should more or less match what happens with your current 0005 + 0006 patches (I didn't recheck with the newer ones though). I got a small but measurable (1 - 1.5%) overall improvement in d3d9 performances, so it looked to me like it is something worth optimizing.
On Monday 06 June 2011 17:55:43 Stefan Dösinger wrote:
Feedback is welcome. I'm also interested in test results, e.g. if the changes break a game, or the performance impact. If those patches cause a 5% performance increase I am happy.
Here are some results from my Radeon test machine: http://bit.ly/mi0i5J . It's about a 1% increase in the real Windows apps, much less than I hoped for. I'll test on Nvidia tomorrow, and manually test a few more (newer) apps for which I don't have PTS tests yet.
For the curious, these are the URLs to my nightly trackers: Radeon 5750: http://openbenchmarking.org/result/1106056-GR-WINE1321145&track_view=20 Geforce 7600: http://openbenchmarking.org/result/1106064-GR-WINE1321175&track_view=20
Watch out for the result ordering. Basically the results are ordered from newest to oldest(left to right), but the Radeon test is a bit more confusing. Openbenchmarking.org orders it in the order I uploaded the results. The Nvidia box boots automatically, the Radeon box is booted manually. So I forgot about the radeon box a few days and tested the older Wine versions a few days later. This is something Michael Larabel and I still have to work on.
Also the uptick in the Radeon tests is the test I ran with my performance patches applied. I forgot to change the result identifier, so OB.org picks it up like a normal nightly test.
Note that the Unigine Heaven tests are using the opengl backend right now. The other tests are D3D-based.
The source code for the d3d_clear, d3d_streamsrc and d3d_drawprim tests is available at https://84.112.174.163/~git/perftest (This is a git repo). I hope the server is stable finally.
Stefan
Hi Stefan/Matteo,
Thanks very much for the effort. I'd like to contribute/help but I don't have a lot of time at the moment to patch and recompile. If you have the wined3d.dll.so saved somewhere I could try it with Starcraft II and let you know (at the moment I have wine-1.3.21 installed).
Cheers, Emanuele
On 07/06/11 00:26, Stefan Dösinger wrote:
On Monday 06 June 2011 17:55:43 Stefan Dösinger wrote:
Feedback is welcome. I'm also interested in test results, e.g. if the changes break a game, or the performance impact. If those patches cause a 5% performance increase I am happy.
Here are some results from my Radeon test machine: http://bit.ly/mi0i5J . It's about a 1% increase in the real Windows apps, much less than I hoped for. I'll test on Nvidia tomorrow, and manually test a few more (newer) apps for which I don't have PTS tests yet.
For the curious, these are the URLs to my nightly trackers: Radeon 5750: http://openbenchmarking.org/result/1106056-GR-WINE1321145&track_view=20 Geforce 7600: http://openbenchmarking.org/result/1106064-GR-WINE1321175&track_view=20
Watch out for the result ordering. Basically the results are ordered from newest to oldest(left to right), but the Radeon test is a bit more confusing. Openbenchmarking.org orders it in the order I uploaded the results. The Nvidia box boots automatically, the Radeon box is booted manually. So I forgot about the radeon box a few days and tested the older Wine versions a few days later. This is something Michael Larabel and I still have to work on.
Also the uptick in the Radeon tests is the test I ran with my performance patches applied. I forgot to change the result identifier, so OB.org picks it up like a normal nightly test.
Note that the Unigine Heaven tests are using the opengl backend right now. The other tests are D3D-based.
The source code for the d3d_clear, d3d_streamsrc and d3d_drawprim tests is available at https://84.112.174.163/~git/perftest (This is a git repo). I hope the server is stable finally.
Stefan
I only looked over about half of this, doesn't look too crazy. You do have trailing spaces in a couple of places though. Here are some comments:
Subject: [PATCH 04/14] wined3d: Don't set FBO attachment filtering to GL_NEAREST
As far as I'm concerned you can just submit this. I was going to do this myself, looks like you got there first.
Subject: [PATCH 05/14] wined3d: Move FBO application into a state handler
This needs an update to debug_d3dstate() as well.
+static BOOL surface_is_framebuffer(struct wined3d_surface *surface) +{
- struct wined3d_device *device = surface->resource.device;
- unsigned int i;
- const struct wined3d_gl_info *gl_info = &device->adapter->gl_info;
- if (surface == device->fb.depth_stencil) return TRUE;
- if (!device->fb.render_targets) return FALSE;
- for (i = 0; i < gl_info->limits.buffers; i++)
- {
if (surface == device->fb.render_targets[i]) return TRUE;
- }
- return FALSE;
+}
I don't know how much this actually hurts, but we could introduce something along the lines of "SFLAG_ACTIVE_RT" to keep track of this. We may have to worry about surfaces being bound to more than one RT attachment point, in which case we'd need a count instead of a flag though.
@@ -1913,6 +1928,10 @@ void surface_set_texture_name(struct wined3d_surface *surface, GLuint new_name, · *name = new_name; surface_force_reload(surface);
- if (surface_is_framebuffer(surface))
- {
IWineD3DDeviceImpl_MarkStateDirty(surface->resource.device, STATE_FRAMEBUFFER);
- }
} · void surface_set_texture_target(struct wined3d_surface *surface, GLenum target) @@ -2317,6 +2336,10 @@ static void surface_allocate_surface(struct wined3d_surface *surface, const stru checkGLcall("glPixelStorei(GL_UNPACK_CLIENT_STORAGE_APPLE, GL_TRUE)"); } LEAVE_GL();
- if (surface_is_framebuffer(surface))
- {
IWineD3DDeviceImpl_MarkStateDirty(surface->resource.device, STATE_FRAMEBUFFER);
- }
}
What are these for?
You may also have to handle an active RT getting unloaded, though I'm not entirely sure if that's allowed or not.
- BOOL all_samplers_mapped = TRUE;
...
all_samplers_mapped = FALSE;
This looks unused.
- for(i = 0; i < swapchain->presentParms.BackBufferCount; i++)
Style.
Subject: [PATCH 09/14] wined3d: Make render target dirtification in draws cheaper
I'm not sure about this one. My first impression is that it looks fragile. The rt_location_flags setup makes assumptions about load_location() internals, which will likely change at some point. In particular, now that most of the location management goes through the appropriate functions instead of messing with the flags themselves, I've been thinking about getting rid of the texture == drawable stuff for FBOs, and instead just making texture <-> drawable loads a nop in load_location(). At the very least this would allow the check to always be against SFLAG_INDRAWABLE instead of surface->rt_location_flags, which IMHO looks more sane. I wonder if the speedup is mostly for load_location(), modify_location() or both though? Maybe we can improve those functions themselves.
Hi,
Thanks for the comments, I've some replies inlined below. The bigger concern I have is that the patches don't improvement by a lot yet, about 1% on AMD GPUs and 1.5% to 2.5% on Nvidia GPUs(in real apps, my benchmarks are a different issue). Because of that my plan is to do more testing and experiment with things like separating the vdecl-streamsrc-vshader states et cetera and see how much impact this has.
All in all I am not yet convinced that my approach with lots of very targeted tests is the way to salvation, and I want to wait a bit before I start screwing up the code. Unfortunately it is still the best way I have.
I'm sending in some of the patches that make sense even without performance improvements. The MINFILTER stuff is a case like that, but I want to test it on pre-dx10 GPUs first.
On Tuesday 14 June 2011 14:20:12 Henri Verbeet wrote:
I only looked over about half of this, doesn't look too crazy. You do have trailing spaces in a couple of places though. Here are some
Ya, I spotted them myself a while later, and didn't look for style issues before sending them to wine-devel. Though maybe I should look for better tools to watch out for slopiness like this.
As far as I'm concerned you can just submit this. I was going to do this myself, looks like you got there first.
Still didn't get around to test this on geforce 7 GPUs. It's possible that the bug this was supposed to fix is still around.
Besides, it is probably not necessary for the other patches. The consideration was that we'd have to verify the filter each draw, but I don't think setting a texture as sampler and render target simultaneously is allowed in d3d.
@@ -1913,6 +1928,10 @@ void surface_set_texture_name(struct ...
- if (surface_is_framebuffer(surface))
- {
IWineD3DDeviceImpl_MarkStateDirty(surface->resource.device,
STATE_FRAMEBUFFER); + } ...
What are these for?
The texture name one is not needed, I've removed that from the patch already. The allocate_surface check is needed in case a ddraw app changes the pixelformat via SetSurfaceDesc.
You may also have to handle an active RT getting unloaded, though I'm not entirely sure if that's allowed or not.
It shouldn't be. RTs must be in the default pool, which can't be unloaded. Of course that doesn't stop our broken location management from screwing up, but that would be a separate issue.
Subject: [PATCH 09/14] wined3d: Make render target dirtification in draws cheaper
I've been thinking about getting rid of the texture == drawable stuff for FBOs, and instead just making texture <-> drawable loads a nop in load_location(). At the very least this would allow the check to always be against SFLAG_INDRAWABLE instead of surface->rt_location_flags, which IMHO looks more sane.
Yes, this sounds much better.
I wonder if the speedup is mostly for load_location(), modify_location() or both though? Maybe we can improve those functions themselves.
It's caused by both, I think it's plain call overhead. I'll double check that though. It may also be hyper-sensitivity of the draw overhead test. 260->270 fps isn't a lot when you consider that native gets ~1100 fps. But right now I have to take what I can get.
On 14 June 2011 15:26, Stefan Dösinger stefandoesinger@gmx.at wrote:
As far as I'm concerned you can just submit this. I was going to do this myself, looks like you got there first.
Still didn't get around to test this on geforce 7 GPUs. It's possible that the bug this was supposed to fix is still around.
Yes, but I think that by now GF7 GPUs are marginal enough that it's not worth keeping the code around for. The Steam HW survey for example reports over 90% D3D10+ cards. Even if it does regress something, I think it makes more sense to tell people to either file a bug with NVIDIA for that or help improve the nouveau driver for that card.
Besides, it is probably not necessary for the other patches. The consideration was that we'd have to verify the filter each draw, but I don't think setting a texture as sampler and render target simultaneously is allowed in d3d.
I'm not so sure. E.g. the docs for the INTZ format say you can have an INTZ texture bound as both depth buffer and texture as long as depth writes are disabled. (This makes some sense, since in that case there aren't any read/write conflicts.)
@@ -1913,6 +1928,10 @@ void surface_set_texture_name(struct ...
- if (surface_is_framebuffer(surface))
- {
- IWineD3DDeviceImpl_MarkStateDirty(surface->resource.device,
STATE_FRAMEBUFFER); + } ...
What are these for?
The texture name one is not needed, I've removed that from the patch already. The allocate_surface check is needed in case a ddraw app changes the pixelformat via SetSurfaceDesc.
I'm not sure that can actually happen. wined3d_surface_set_format() insists the format must be WINED3DFMT_UNKNOWN, so it can't be part of a working FBO entry before the format is changed. That probably also means clearing the allocation flags there is a bit silly.
You may also have to handle an active RT getting unloaded, though I'm not entirely sure if that's allowed or not.
It shouldn't be. RTs must be in the default pool, which can't be unloaded.
Even in ddraw?
I wonder if the speedup is mostly for load_location(), modify_location() or both though? Maybe we can improve those functions themselves.
It's caused by both, I think it's plain call overhead. I'll double check that though. It may also be hyper-sensitivity of the draw overhead test. 260->270 fps isn't a lot when you consider that native gets ~1100 fps. But right now I have to take what I can get.
Maybe surface_load_location() could do an initial location check a bit earlier. (And some of the code before the current check could also be removed when we get rid of texture == drawable.) For surface_modify_location(), the overlay code probably doesn't belong in there, we could do a similar early check if the flags already match, and maybe we should split it in two functions.
On Tuesday 14 June 2011 16:15:35 Henri Verbeet wrote:
Yes, but I think that by now GF7 GPUs are marginal enough that it's not worth keeping the code around for. The Steam HW survey for example reports over 90% D3D10+ cards. Even if it does regress something, I think it makes more sense to tell people to either file a bug with NVIDIA for that or help improve the nouveau driver for that card.
Fair enough. Note that Matteo found a similar bug with GL_TEXTURE_BASE_LEVEL on newer cards, but I encouraged him to report it to Nvidia rather than adding a hack to this function.
I'm not so sure. E.g. the docs for the INTZ format say you can have an INTZ texture bound as both depth buffer and texture as long as depth writes are disabled. (This makes some sense, since in that case there aren't any read/write conflicts.)
Yeah, although Aras Pranckevičius d3d hacks page claims that it doesn't work well(slow). Still, there are corner cases where the sampler states of a current render target may be changed on the d3d side and applied to the GL texture.
I'm not sure that can actually happen. wined3d_surface_set_format() insists the format must be WINED3DFMT_UNKNOWN, so it can't be part of a working FBO entry before the format is changed. That probably also means clearing the allocation flags there is a bit silly.
Hmm, I think I should reinvestigate those ddraw apps that call SetSurfaceDesc with strange parameters. Unfortunately I haven't documented which apps do that. I remember Windows Media Player 9 which sets the lpSurface field and expects subsequent Lock() calls to return this pointer :-O
It shouldn't be. RTs must be in the default pool, which can't be unloaded.
Even in ddraw?
We need more tests for the DDSCAPS* mess, but the Direct3D7 docs say DDSCAPS2_TEXTUREMANAGE is incompatible with DDSCAPS_VIDEOMEMORY. I believe that DDSCAPS_3DDEVICE mandates DDSCAPS_VIDEOMEMORY, but atm I can't find the proper place in the docs to back this up.
The d3d7 docs also say that TEXTUREMANAGE surfaces can be blitted to, and suggests that this is done in software and the result is then uploaded to the video memory copy.