Using IMF2DBuffer2_Lock2DSize with LockFlags_Write dramatically improves performance over IMFMediaBuffer_Lock when using a DXGI buffer.
IMFMediaBuffer_Lock does not know that this buffer will not be read and therefore performs an unnecessary transfer of the texture from GPU to CPU before it is overwritten.
-- v2: winegstreamer: Optimise copy to DXGI buffer.
From: Brendan McGrath bmcgrath@codeweavers.com
Using IMF2DBuffer2_Lock2DSize with LockFlags_Write dramatically improves performance over IMFMediaBuffer_Lock when using a DXGI buffer.
IMFMediaBuffer_Lock does not know that this buffer will not be read and therefore performs an unnecessary transfer of the texture from GPU to CPU before it is overwritten. --- dlls/winegstreamer/gst_private.h | 2 +- dlls/winegstreamer/media_sink.c | 2 +- dlls/winegstreamer/wg_sample.c | 31 +++++++++++++++++++++++++------ 3 files changed, 27 insertions(+), 8 deletions(-)
diff --git a/dlls/winegstreamer/gst_private.h b/dlls/winegstreamer/gst_private.h index 0f7d945ba37..0fa1aa0dac5 100644 --- a/dlls/winegstreamer/gst_private.h +++ b/dlls/winegstreamer/gst_private.h @@ -156,7 +156,7 @@ extern HRESULT mfplat_DllRegisterServer(void); IMFMediaType *mf_media_type_from_wg_format(const struct wg_format *format); void mf_media_type_to_wg_format(IMFMediaType *type, struct wg_format *format);
-HRESULT wg_sample_create_mf(IMFSample *sample, struct wg_sample **out); +HRESULT wg_sample_create_mf(IMFSample *sample, struct wg_sample **out, MF2DBuffer_LockFlags lockFlags); HRESULT wg_sample_create_quartz(IMediaSample *sample, struct wg_sample **out); HRESULT wg_sample_create_dmo(IMediaBuffer *media_buffer, struct wg_sample **out); void wg_sample_release(struct wg_sample *wg_sample); diff --git a/dlls/winegstreamer/media_sink.c b/dlls/winegstreamer/media_sink.c index 5ee2c44dc70..9eb64756ec6 100644 --- a/dlls/winegstreamer/media_sink.c +++ b/dlls/winegstreamer/media_sink.c @@ -615,7 +615,7 @@ static HRESULT media_sink_process(struct media_sink *media_sink, IMFSample *samp if (FAILED(hr = media_sink_write_stream(media_sink))) WARN("Failed to write output samples to stream, hr %#lx.\n", hr);
- if (FAILED(hr = wg_sample_create_mf(sample, &wg_sample))) + if (FAILED(hr = wg_sample_create_mf(sample, &wg_sample, MF2DBuffer_LockFlags_Read))) return hr;
if (SUCCEEDED(IMFSample_GetSampleTime(sample, &time))) diff --git a/dlls/winegstreamer/wg_sample.c b/dlls/winegstreamer/wg_sample.c index 116dbb1f3ec..03dc7c9de5b 100644 --- a/dlls/winegstreamer/wg_sample.c +++ b/dlls/winegstreamer/wg_sample.c @@ -52,6 +52,7 @@ struct sample { IMFSample *sample; IMFMediaBuffer *buffer; + IMF2DBuffer2 *buffer2d2; } mf; struct { @@ -79,7 +80,15 @@ static void mf_sample_destroy(struct wg_sample *wg_sample)
TRACE_(mfplat)("wg_sample %p.\n", wg_sample);
- IMFMediaBuffer_Unlock(sample->u.mf.buffer); + if (sample->u.mf.buffer2d2) + { + IMF2DBuffer2_Unlock2D(sample->u.mf.buffer2d2); + IMF2DBuffer2_Release(sample->u.mf.buffer2d2); + } + else + { + IMFMediaBuffer_Unlock(sample->u.mf.buffer); + } IMFMediaBuffer_Release(sample->u.mf.buffer); IMFSample_Release(sample->u.mf.sample); } @@ -89,18 +98,28 @@ static const struct wg_sample_ops mf_sample_ops = mf_sample_destroy, };
-HRESULT wg_sample_create_mf(IMFSample *mf_sample, struct wg_sample **out) +HRESULT wg_sample_create_mf(IMFSample *mf_sample, struct wg_sample **out, MF2DBuffer_LockFlags lockFlags) { DWORD current_length, max_length; + LONG pitch; struct sample *sample; - BYTE *buffer; + BYTE *buffer, *scanline; HRESULT hr;
if (!(sample = calloc(1, sizeof(*sample)))) return E_OUTOFMEMORY; if (FAILED(hr = IMFSample_ConvertToContiguousBuffer(mf_sample, &sample->u.mf.buffer))) goto fail; - if (FAILED(hr = IMFMediaBuffer_Lock(sample->u.mf.buffer, &buffer, &max_length, ¤t_length))) + if (SUCCEEDED(hr = IMFMediaBuffer_QueryInterface(sample->u.mf.buffer, &IID_IMF2DBuffer2, (void**)&sample->u.mf.buffer2d2)) && + FAILED(hr = IMF2DBuffer2_Lock2DSize(sample->u.mf.buffer2d2, lockFlags, &scanline, &pitch, &buffer, &max_length))) + { + IMF2DBuffer2_Release(sample->u.mf.buffer2d2); + sample->u.mf.buffer2d2 = NULL; + } + + if (SUCCEEDED(hr)) + current_length = max_length; + else if (FAILED(hr = IMFMediaBuffer_Lock(sample->u.mf.buffer, &buffer, &max_length, ¤t_length))) goto fail;
IMFSample_AddRef((sample->u.mf.sample = mf_sample)); @@ -313,7 +332,7 @@ HRESULT wg_transform_push_mf(wg_transform_t transform, IMFSample *sample,
TRACE_(mfplat)("transform %#I64x, sample %p, queue %p.\n", transform, sample, queue);
- if (FAILED(hr = wg_sample_create_mf(sample, &wg_sample))) + if (FAILED(hr = wg_sample_create_mf(sample, &wg_sample, MF2DBuffer_LockFlags_Read))) return hr;
if (SUCCEEDED(IMFSample_GetSampleTime(sample, &time))) @@ -347,7 +366,7 @@ HRESULT wg_transform_read_mf(wg_transform_t transform, IMFSample *sample,
TRACE_(mfplat)("transform %#I64x, sample %p, flags %p.\n", transform, sample, flags);
- if (FAILED(hr = wg_sample_create_mf(sample, &wg_sample))) + if (FAILED(hr = wg_sample_create_mf(sample, &wg_sample, MF2DBuffer_LockFlags_Write))) return hr;
wg_sample->size = 0;
On Thu Jul 4 22:56:45 2024 +0000, Brendan McGrath wrote:
changed this line in [version 2 of the diff](/wine/wine/-/merge_requests/5978/diffs?diff_id=120655&start_sha=128dc290b03fe1bb5e307780f826b46f2dcdac0b#03c4efe3c0041a988fac0ea9eb3a9e8e49f116f3_114_114)
It turns out the issue wasn't related to the padding; I had simply introduced a bug. Namely I hadn't catered for the use-case where we read from the DXGI buffer rather than write. I've fixed that now, and I now seem to get the same results with and without this MR.
I only hit this issue once I started using an isolated video processor transform and passed it DXGI buffers. Prior to that, I was testing with `IMFSourceReader`, which was passing the transform buffers from wg_parser, which appear not to be DXGI buffers.
On Thu Jul 4 23:01:39 2024 +0000, Brendan McGrath wrote:
It turns out the issue wasn't related to the padding; I had simply introduced a bug. Namely I hadn't catered for the use-case where we read from the DXGI buffer rather than write. I've fixed that now, and I now seem to get the same results with and without this MR. I only hit this issue once I started using an isolated video processor transform and passed it DXGI buffers. Prior to that, I was testing with `IMFSourceReader`, which was passing the transform buffers from wg_parser, which appear not to be DXGI buffers.
Yeah, D3D buffers with software video conversion will be extremely inefficient. The proper solution here is to use the ID3D11VideoProcessor which I intend to use in Proton to do everything on the GPU. It's not yet implemented in WineD3D but I believe Yuxuan is looking into it. DXVK implements it already.
D3D buffers with software video conversion will be extremely inefficient
That is true. But I'm glad I tried, as it highlighted the bug I'd introduced.
The proper solution here is to use the ID3D11VideoProcessor
I saw you mention that in an email, so I played with that as well. But I found it was still bottlenecked by the inefficient GPU copies. With the two GPU copies optimised (via this MR and MR!5979), I can play 4K video at 60 fps. Adding `ID3D11VideoProcessor` to that halved CPU usage. That was with a 4 second video, so it could be more if the video is longer.
Marking this as draft for now, as I'm currently looking in to test cases that pass on Windows and fail on Wine.
This merge request was closed by Brendan McGrath.
Closing this, as preliminary testing shows it doesn't improve the playback performance of the 4K logo in Pixelia.