Running CHRONO TRIGGER with this commit saves at least 3 ms per frame in the sample copier on my computer.
Signed-off-by: Giovanni Mascellani gmascellani@codeweavers.com --- dlls/mfplat/buffer.c | 42 +++++++++++++++++++++++++++++++----------- 1 file changed, 31 insertions(+), 11 deletions(-)
diff --git a/dlls/mfplat/buffer.c b/dlls/mfplat/buffer.c index f161bb29d80..e20961169f8 100644 --- a/dlls/mfplat/buffer.c +++ b/dlls/mfplat/buffer.c @@ -57,6 +57,7 @@ struct buffer int pitch; unsigned int locks; p_copy_image_func copy_image; + BOOL copy; } _2d; struct { @@ -380,17 +381,28 @@ static HRESULT WINAPI d3d9_surface_buffer_Lock(IMFMediaBuffer *iface, BYTE **dat { D3DLOCKED_RECT rect;
- if (!(buffer->_2d.linear_buffer = malloc(ALIGN_SIZE(buffer->_2d.plane_size, MF_64_BYTE_ALIGNMENT)))) - hr = E_OUTOFMEMORY; + hr = IDirect3DSurface9_LockRect(buffer->d3d9_surface.surface, &rect, NULL, 0);
if (SUCCEEDED(hr)) { - hr = IDirect3DSurface9_LockRect(buffer->d3d9_surface.surface, &rect, NULL, 0); - if (SUCCEEDED(hr)) + if (rect.Pitch == buffer->_2d.width) { - copy_image(buffer, buffer->_2d.linear_buffer, buffer->_2d.width, rect.pBits, rect.Pitch, - buffer->_2d.width, buffer->_2d.height); - IDirect3DSurface9_UnlockRect(buffer->d3d9_surface.surface); + buffer->_2d.linear_buffer = rect.pBits; + buffer->_2d.copy = FALSE; + } + else + { + if ((buffer->_2d.linear_buffer = malloc(ALIGN_SIZE(buffer->_2d.plane_size, MF_64_BYTE_ALIGNMENT)))) + { + copy_image(buffer, buffer->_2d.linear_buffer, buffer->_2d.width, rect.pBits, rect.Pitch, + buffer->_2d.width, buffer->_2d.height); + IDirect3DSurface9_UnlockRect(buffer->d3d9_surface.surface); + buffer->_2d.copy = TRUE; + } + else + { + hr = E_OUTOFMEMORY; + } } } } @@ -425,14 +437,22 @@ static HRESULT WINAPI d3d9_surface_buffer_Unlock(IMFMediaBuffer *iface) { D3DLOCKED_RECT rect;
- if (SUCCEEDED(hr = IDirect3DSurface9_LockRect(buffer->d3d9_surface.surface, &rect, NULL, 0))) + if (buffer->_2d.copy) + { + if (SUCCEEDED(hr = IDirect3DSurface9_LockRect(buffer->d3d9_surface.surface, &rect, NULL, 0))) + { + copy_image(buffer, rect.pBits, rect.Pitch, buffer->_2d.linear_buffer, buffer->_2d.width, + buffer->_2d.width, buffer->_2d.height); + IDirect3DSurface9_UnlockRect(buffer->d3d9_surface.surface); + } + + free(buffer->_2d.linear_buffer); + } + else { - copy_image(buffer, rect.pBits, rect.Pitch, buffer->_2d.linear_buffer, buffer->_2d.width, - buffer->_2d.width, buffer->_2d.height); IDirect3DSurface9_UnlockRect(buffer->d3d9_surface.surface); }
- free(buffer->_2d.linear_buffer); buffer->_2d.linear_buffer = NULL; }
I don't think this works. It should be possible to LockRect() after buffer was locked, if you keep surface locked this won't work. I don't think we have tests for that, but that's what quick testing on Windows shows.
We could have a shortcut in MFCopyImage() first, to have a single copy call when strides match, instead of calling per row. Next step could be to have some SIMD variants, with non-temporal copy like docs suggest. No idea how much this improves performance, but for large enough copies it's meant to bypass cache at least, I think.
Hi,
Il 11/02/22 09:17, Nikolay Sivov ha scritto:
I don't think this works. It should be possible to LockRect() after buffer was locked, if you keep surface locked this won't work. I don't think we have tests for that, but that's what quick testing on Windows shows.
Ouch. So Microsoft really does two more copies per Lock()? Bad!
Maybe the thing we could do anyway is to make call the first LockRect() with D3DLOCK_READONLY and the second one with D3DLOCK_DISCARD, though I doubt it will save anything near 3 ms per frame.
It would be nice if the Lock() interface had something similar to DISCARD, but alas it doesn't.
We could have a shortcut in MFCopyImage() first, to have a single copy call when strides match, instead of calling per row. Next step could be to have some SIMD variants, with non-temporal copy like docs suggest. No idea how much this improves performance, but for large enough copies it's meant to bypass cache at least, I think.
I tried the single copy thing, but I couldn't see any significant change, so I didn't even bother submitting. As for the non-temporal copy, I don't know much about it either, but something makes me feel it won't change that much either.
Thanks, Giovanni.
On 2/11/22 11:29, Giovanni Mascellani wrote:
Hi,
Il 11/02/22 09:17, Nikolay Sivov ha scritto:
I don't think this works. It should be possible to LockRect() after buffer was locked, if you keep surface locked this won't work. I don't think we have tests for that, but that's what quick testing on Windows shows.
Ouch. So Microsoft really does two more copies per Lock()? Bad!
Maybe the thing we could do anyway is to make call the first LockRect() with D3DLOCK_READONLY and the second one with D3DLOCK_DISCARD, though I doubt it will save anything near 3 ms per frame.
It would be nice if the Lock() interface had something similar to DISCARD, but alas it doesn't.
Lock2DSize() has some flags, I don't know if they map to anything in d3d or are ignored. It's probably worth looking at LockRect() flags, like you described.
Regarding amount of added latency, it's not accurate to compare to the game running on Windows, where by default it might use a pipeline with hw decoding, that only has one copy from system memory - to input decoder surface. Instead manually configured test is what should give an idea of how fast worst case is there, not to say that we should settle for that of course.
We could have a shortcut in MFCopyImage() first, to have a single copy call when strides match, instead of calling per row. Next step could be to have some SIMD variants, with non-temporal copy like docs suggest. No idea how much this improves performance, but for large enough copies it's meant to bypass cache at least, I think.
I tried the single copy thing, but I couldn't see any significant change, so I didn't even bother submitting. As for the non-temporal copy, I don't know much about it either, but something makes me feel it won't change that much either.
Depending on buffer sizes it might have a positive impact, it's impossible to say for me, unless we try. It's also possible that copier is smarter about this.
Thanks, Giovanni.