This fixes a regression introduced by 15d53761a5fbfc12fc5f9974c029dace00eab33d.
From: Zebediah Figura zfigura@codeweavers.com
This improves performance in Prince of Persia 3D.
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=44863 --- dlls/ddraw/ddraw_private.h | 1 + dlls/ddraw/device.c | 4 ++-- dlls/ddraw/vertexbuffer.c | 3 ++- 3 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/dlls/ddraw/ddraw_private.h b/dlls/ddraw/ddraw_private.h index e3fac63a38f..c65d502a1dc 100644 --- a/dlls/ddraw/ddraw_private.h +++ b/dlls/ddraw/ddraw_private.h @@ -598,6 +598,7 @@ struct d3d_vertex_buffer DWORD size; BOOL dynamic; bool discarded; + bool sysmem; };
HRESULT d3d_vertex_buffer_create(struct d3d_vertex_buffer **buffer, struct ddraw *ddraw, diff --git a/dlls/ddraw/device.c b/dlls/ddraw/device.c index 1cfef5007d5..28b9c77c0d2 100644 --- a/dlls/ddraw/device.c +++ b/dlls/ddraw/device.c @@ -4119,7 +4119,7 @@ static HRESULT d3d_device7_DrawPrimitiveVB(IDirect3DDevice7 *iface, D3DPRIMITIVE
stride = get_flexible_vertex_size(vb_impl->fvf);
- if (vb_impl->Caps & D3DVBCAPS_SYSTEMMEMORY) + if (vb_impl->sysmem) { TRACE("Drawing from D3DVBCAPS_SYSTEMMEMORY vertex buffer, forwarding to DrawPrimitive().\n"); wined3d_mutex_lock(); @@ -4237,7 +4237,7 @@ static HRESULT d3d_device7_DrawIndexedPrimitiveVB(IDirect3DDevice7 *iface,
vb_impl->discarded = false;
- if (vb_impl->Caps & D3DVBCAPS_SYSTEMMEMORY) + if (vb_impl->sysmem) { TRACE("Drawing from D3DVBCAPS_SYSTEMMEMORY vertex buffer, forwarding to DrawIndexedPrimitive().\n"); wined3d_mutex_lock(); diff --git a/dlls/ddraw/vertexbuffer.c b/dlls/ddraw/vertexbuffer.c index 56a4c3e93e4..a75a3a15682 100644 --- a/dlls/ddraw/vertexbuffer.c +++ b/dlls/ddraw/vertexbuffer.c @@ -115,7 +115,7 @@ static HRESULT d3d_vertex_buffer_create_wined3d_buffer(struct d3d_vertex_buffer if (dynamic) desc.usage |= WINED3DUSAGE_DYNAMIC; desc.bind_flags = WINED3D_BIND_VERTEX_BUFFER; - if (buffer->Caps & D3DVBCAPS_SYSTEMMEMORY) + if (buffer->sysmem) desc.access = WINED3D_RESOURCE_ACCESS_CPU | WINED3D_RESOURCE_ACCESS_MAP_R | WINED3D_RESOURCE_ACCESS_MAP_W; else desc.access = WINED3D_RESOURCE_ACCESS_GPU | WINED3D_RESOURCE_ACCESS_MAP_R | WINED3D_RESOURCE_ACCESS_MAP_W; @@ -460,6 +460,7 @@ HRESULT d3d_vertex_buffer_create(struct d3d_vertex_buffer **vertex_buf, buffer->Caps = desc->dwCaps; buffer->fvf = desc->dwFVF; buffer->size = get_flexible_vertex_size(desc->dwFVF) * desc->dwNumVertices; + buffer->sysmem = ((buffer->Caps & D3DVBCAPS_SYSTEMMEMORY) || buffer->version < 7);
wined3d_mutex_lock();
From: Zebediah Figura zfigura@codeweavers.com
This improves performance in Prince of Persia 3D.
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=44863 --- dlls/ddraw/device.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/dlls/ddraw/device.c b/dlls/ddraw/device.c index 28b9c77c0d2..785afe8c5f6 100644 --- a/dlls/ddraw/device.c +++ b/dlls/ddraw/device.c @@ -3573,9 +3573,11 @@ static HRESULT d3d_device7_DrawIndexedPrimitive(IDirect3DDevice7 *iface, WORD *indices, DWORD index_count, DWORD flags) { struct d3d_device *device = impl_from_IDirect3DDevice7(iface); + unsigned int idx_size = index_count * sizeof(*indices); + unsigned short min_index = USHRT_MAX, max_index = 0; + unsigned int i; HRESULT hr; UINT stride = get_flexible_vertex_size(fvf); - UINT vtx_size = stride * vertex_count, idx_size = index_count * sizeof(*indices); UINT vb_pos, ib_pos;
TRACE("iface %p, primitive_type %#x, fvf %#lx, vertices %p, vertex_count %lu, " @@ -3588,11 +3590,17 @@ static HRESULT d3d_device7_DrawIndexedPrimitive(IDirect3DDevice7 *iface, return D3D_OK; }
+ for (i = 0; i < index_count; ++i) + { + min_index = min(min_index, indices[i]); + max_index = max(max_index, indices[i]); + } + /* Set the D3DDevice's FVF */ wined3d_mutex_lock();
- if (FAILED(hr = wined3d_streaming_buffer_upload(device->wined3d_device, - &device->vertex_buffer, vertices, vtx_size, stride, &vb_pos))) + if (FAILED(hr = wined3d_streaming_buffer_upload(device->wined3d_device, &device->vertex_buffer, + (char *)vertices + (min_index * stride), (max_index + 1 - min_index) * stride, stride, &vb_pos))) goto done;
if (FAILED(hr = wined3d_streaming_buffer_upload(device->wined3d_device, @@ -3609,7 +3617,7 @@ static HRESULT d3d_device7_DrawIndexedPrimitive(IDirect3DDevice7 *iface, wined3d_primitive_type_from_ddraw(primitive_type), 0); wined3d_device_apply_stateblock(device->wined3d_device, device->state); d3d_device_sync_surfaces(device); - wined3d_device_context_draw_indexed(device->immediate_context, vb_pos / stride, + wined3d_device_context_draw_indexed(device->immediate_context, (int)(vb_pos / stride) - min_index, ib_pos / sizeof(*indices), index_count, 0, 0);
done:
Is the first patch specifically a Prince of Persia hack, or is there a more general reason to do this? I think either way a small comment would be in order, although it might just be because I'm not familiar with ddraw.
Is the first patch specifically a Prince of Persia hack, or is there a more general reason to do this? I think either way a small comment would be in order, although it might just be because I'm not familiar with ddraw.
I don't really have evidence that native ddraw works this way, but I suspect this is something we want in general. My logic is that version 4 doesn't respect the "dynamic" flags (DDLOCK_DISCARDCONTENTS and DDLOCK_NOOVERWRITE), and in light of that I don't think GPU | MAP_R | MAP_W is going to perform better than making it a CPU buffer instead. Because the dynamic flags don't work I can't really think of a way to test whether this is the case on Windows either.
The counterargument is that D3DVBCAPS_SYSTEMMEMORY does exist, and presumably that implies it does something. But at a certain point, behaviour that makes sense for GL or Vulkan probably outweighs behaviour that matches native.
I'll try to formulate the above into a comment.
I've no heartfelt objections over this patch.
Re dynamic flags in ddraw4, ddraw4's test_map_synchronisation() shows that DDLOCK_NOOVERWRITE is ignored. Other than fancy heuristics, I don't see how we can make GPU buffers fast with the usual map-draw-map-draw pattern.
I guess we could test if D3DVBCAPS_WRITEONLY gives us write-combined memory, but if not that doesn't prove much.
We could also move a non-D3DVBCAPS_SYSTEMMEMORY to sysmem the 5th time it is mapped, or promote it to vidmem the 5th time it is used for draws without being mapped. That'd break map address stability though. Also it'd help to know a game that profits from such shenanigans before implementing them.
On Thu Dec 21 17:02:38 2023 +0000, Zebediah Figura wrote:
Is the first patch specifically a Prince of Persia hack, or is there a
more general reason to do this? I think either way a small comment would be in order, although it might just be because I'm not familiar with ddraw. I don't really have evidence that native ddraw works this way, but I suspect this is something we want in general. My logic is that version 4 doesn't respect the "dynamic" flags (DDLOCK_DISCARDCONTENTS and DDLOCK_NOOVERWRITE), and in light of that I don't think GPU | MAP_R | MAP_W is going to perform better than making it a CPU buffer instead. Because the dynamic flags don't work I can't really think of a way to test whether this is the case on Windows either. The counterargument is that D3DVBCAPS_SYSTEMMEMORY does exist, and presumably that implies it does something. But at a certain point, behaviour that makes sense for GL or Vulkan probably outweighs behaviour that matches native. I'll try to formulate the above into a comment.
Actually, I suspect that 1/2 isn't really a hack.
Notice that Hardware T&L didn't exist before d3d7, so a video memory vertex buffer containing untransformed data doesn't make much sense (while it could be useful still for pretransformed data, I guess). Interestingly, ddraw's CreateVertexBuffer does in fact require the specification of the vertex format; I'd just replace a `desc->dwFVF & D3DFVF_XYZ` check for the `buffer->version < 7` one and call it a day :slight_smile:
On Thu Dec 21 17:02:38 2023 +0000, Matteo Bruni wrote:
Actually, I suspect that 1/2 isn't really a hack. Notice that Hardware T&L didn't exist before d3d7, so a video memory vertex buffer containing untransformed data doesn't make much sense (while it could be useful still for pretransformed data, I guess). Interestingly, ddraw's CreateVertexBuffer does in fact require the specification of the vertex format; I'd just replace a `desc->dwFVF & D3DFVF_XYZ` check for the `buffer->version < 7` one and call it a day :slight_smile:
A couple more details I forgot earlier... According to the docs, including D3DVBCAPS_SYSTEMMEMORY forces system memory allocation but the implementation is allowed to allocate the buffer in system memory even without it. Also the flag can be queried after creation (via ::GetVertexBufferDesc()) to know if the buffer was in fact allocated in system memory.
I guess we want to try and write a test, it might even give some answers.
So I now have access to a Windows 98 machine and some contemporaneous hardware, and results are... weird.
The only working card I have with hardware TCL is an NV17. It passes ddraw7's test_map_synchronisation() just fine.
It fails ddraw4's test_map_synchronisation() because it... seems to draw nothing at all if asked to draw more than 24 vertices. We don't actually check the output colour, which should be fixed, but the problems come when all of the draws pass pretty much immediately, which throws off the "how many primitives can we draw in 100 ms" calculation.
The same thing seems to happen if, on either ddraw4 or ddraw7, I create the VB with D3DVBCAPS_SYSTEMMEMORY. This does, at least, strongly suggest that in ddraw4 (and probably in ddraw7 too if you don't create a hardware TnL device), a non-RHW buffer is always in system memory.
On Stefan's suggestion I tried hacking ddraw4's test_map_synchronisation() to draw pretransformed vertices. That worked strangely. The NOOVERWRITE tests seem to consistently get 0xffff00, suggesting they are unsynchronized after all. The non-NOOVERWRITE tests randomly get either 0x000000 (note that the clear colour is 0x0000ff), 0xffff00 (as if unsynchronized), or 0xff00ff.
In no case did I see GetVertexBufferDesc() returning anything other than the exact flags that it was passed.
ddraw7 with IID_IDirect3DHALDevice behaves mostly like ddraw4, including the same random results with pretransformed vertices. However it at least successfully draws more than 24 vertices, and always returns the "synchronized" colour of 0x007f80.
Interestingly, ddraw's CreateVertexBuffer does in fact require the specification of the vertex format; I'd just replace a `desc->dwFVF & D3DFVF_XYZ` check for the `buffer->version < 7` one and call it a day :slight_smile:
Unfortunately, the offending vertex buffers in Prince of Persia 3D are in fact XYZRHW. (And it does use the HAL device.)
On Mon Apr 8 06:24:40 2024 +0000, Elizabeth Figura wrote:
Interestingly, ddraw's CreateVertexBuffer does in fact require the
specification of the vertex format; I'd just replace a `desc->dwFVF & D3DFVF_XYZ` check for the `buffer->version < 7` one and call it a day :slight_smile: Unfortunately, the offending vertex buffers in Prince of Persia 3D are in fact XYZRHW. (And it does use the HAL device.)
Is that with the HAL or the TnLHAL device (or, does that make any difference)?
On Mon Apr 8 06:24:40 2024 +0000, Matteo Bruni wrote:
Is that with the HAL or the TnLHAL device (or, does that make any difference)?
Scratch that, the game obviously uses d3d < 7 or the current patch wouldn't help. And I guess your tests from a couple of comments above answer my question.
Okay, so I did some more performance testing on the 98 machine. I'm attaching a diff that shows the general form of the tests I did, and the results (sorry they're a bit of a mess). Cards tested were a Geforce 4 MX (NV17), which supports hardware TCL, and a Rage 128, which doesn't.
The results are weird, but there's three consistent patterns that emerge from this and previous tests:
(1) The NVidia card shows a difference between D3DVBCAPS_SYSMEM and no SYSMEM, even when vertex processing is done in software. The ATI card doesn't. The explanation here may be that creating and locking a vertex buffer doesn't actually depend on the device IID.
(2) RHW buffers on a non-TnLHal device broadly act more like XYZ buffers on a TnLHal device than they act like XYZ buffers on a non-TnLHal device—e.g. NOOVERWRITE seems to be unsynchronized, and D3DVBCAPS_SYSMEM is slower than no D3DVBCAPS_SYSMEM. This is not really surprising.
(3) DISCARD and NOOVERWRITE flags do nothing on ddraw4. Oddly, they aren't rejected either (and I tested that other flags are rejected), but that may just be a consequence of having a runtime that supports ddraw7.
It's also worth mentioning that the test suggests that Prince of Persia should perform *terribly* on the NVidia card, but decently on the Rage 128, and in fact the demo does exactly this. I didn't test the full game, but if it performs better it's not unlikely that it does so via driver hacks.
(1) and (2) do suggest that we should be honouring the SYSMEM flag, and tests suggest that it matters even for ddraw4. At the same time, I haven't seen any evidence that a modern GPU will *ever* perform better with a vertex buffer not in sysmem unless it's using the streaming pattern, and (3) means that it never will on ddraw4. Maybe things were different for a contemporaneous GPU, but I don't think that matters when deciding how to optimize (even before dropping support for them recently).
The upshot of this is that I think 1/2 of this patch series, as is, really does do the right thing. I'll rebase it, and add a comment that explains this better, as requested.
Okay, so I did some more performance testing on the 98 machine. I'm attaching a diff that shows the general form of the tests I did, and the results (sorry they're a bit of a mess).
Actually attaching.
[scratch.diff](/uploads/cdff8dc281582be387f3af2e2fa7d113/scratch.diff)
[scratch](/uploads/3da06285dd4253e9d9966d94e6b41036/scratch)