On Tue May 24 18:12:09 2022 +0000, **** wrote:
Zebediah Figura replied on the mailing list:
On 5/24/22 01:07, Rémi Bernon (@rbernon) wrote: > >>> To >>> get zero-copy you'll have to add even more entry points, to provide >>> the output buffer beforehand, and release it after the processing. >> >> I'm confused about this, because this doesn't match my (somewhat dim) >> awareness of what would be necessary for zero-copy. So maybe it would >> help to hash that out now completely. >> >> My understanding is: >> >> * in some cases we can and should transform data in-place (perhaps >> because the PE side expects it, not that I'm aware of any such cases). >> In that case we should feed a writable GstBuffer into the pipeline, and >> if the output buffer was not the same as the input buffer, blit the >> latter into the former. >> >> * if we cannot transform data in-place (either because GStreamer doesn't >> support it or because it'd confuse the PE side, not that I'm aware of >> any such cases there either), we should try to let GStreamer elements >> fill buffers allocated by the PE side. In order to achieve this we >> should set up a GstBufferPool ahead of time containing a fixed number of >> buffers. (I think we can guarantee we won't run out?) If the output >> buffer comes from this pool, we can retrieve the corresponding input >> buffer (the original intent of struct wg_sample) and return that pointer >> to user space. If not, we should call gst_buffer_pool_acquire_buffer() >> ourselves and blit the output buffer into it. >> >> In both cases, though, it doesn't make sense to force a specific output >> buffer unless it's the same as the input buffer, which means there's no >> need for an extra API. Is there some extra consideration I'm missing here? >> >> The other consequence that falls out from this is that the act of >> pushing data will inherently cause data to be written; i.e. there is no >> concept of "peek" quite like pipes. On the other hand, we can still >> "peek" at the last output buffer (and, more importantly, its associated >> metadata) without actually dequeuing it (and hence without changing the >> wg_transform state at all), which is kind of what I was trying to >> propose in the first place. It's not obvious to me why this is >> problematic; am I missing something here as well? > > There's no such thing as transforming in-place. Input buffers (as in > provided by the app to ProcessInput) and output buffers (as in provided > by the app to ProcessOutput) will never be the same. They just cannot > in general simply because of a difference of size, but also because it's > not how it's supposed to work: input buffers and kept referenced until > they have been fully processed, and until one or more output buffers > have been filled from it. I'm looking at MFT_INPUT_STREAM_PROCESSES_IN_PLACE. I haven't checked whether anything interesting exposes it, though; maybe nothing does? (I don't think any of the GStreamer elements we care about are capable of in-place transform either, but theoretically some of them could be.) > > What zero-copy means however, is avoiding any unnecessary copy when > passing data back and forth GStreamer and unix-side in general. > > The only way this can be done, for ProcessOutput, like I described > previously already, is by: > > 1) providing the application provided output buffer to GStreamer > *before* processing any input, so that any future buffer allocation > can use it to write the decoded data. > > 2) process the input buffers that were queued, checking for a format > change event, or whether the output buffer has been used. > > 3) blit the output data to the output buffer if it wasn't picked up. > > 4) remove the output buffer from any GStreamer buffer thay may still be > referencing it (This last step also possibly but hopefully doesn't > involve copying the data back from the output buffer to GStreamer > memory if the data isn't supposed to be discarded, this can happen > in some case, at least when format change event is returned so we > do not lose the data but not only). > Okay, the thing I was missing was that in the case of Media Foundation transforms (or at least some of them), we can't actually choose the sample buffer, but rather it gets provided to us by the application on a per-sample basis. Same for DMOs I guess. (On the other hand, DirectShow can pool, and so can all three parser frontends. And Media Foundation transforms are capable of pooling, although neither of the ones we currently implement actually do.) This is ugly to deal with. I guess you can provide a specific buffer by (ab)using GstBufferPool's acquire_buffer() method to block until ProcessOutput() is called and we have a buffer; I don't know if you had a better plan than that. On the other hand, it's not clear to me how we're going to deal with MF_E_TRANSFORM_STREAM_CHANGE or MFT_OUTPUT_STATUS_SAMPLE_READY while maintaining zero-copy. What is your plan for those?
I'm looking at MFT_INPUT_STREAM_PROCESSES_IN_PLACE. I haven't checked whether anything interesting exposes it, though; maybe nothing does?
(I don't think any of the GStreamer elements we care about are capable of in-place transform either, but theoretically some of them could be.)
As far as the tests go, no transform is currently exposing this (and we start to have quite a few in the tests).
There's also the MFT_OUTPUT_STREAM_(CAN_)PROVIDES_SAMPLES flags that could have been useful, but there too, it doesn't seem to be used by native transforms, and I think games aren't expecting to see it.
Thus the implementation has all been around the worst case, which is when output buffer is app provided and cannot be swapped behind the scenes.
This is ugly to deal with. I guess you can provide a specific buffer by (ab)using GstBufferPool's acquire_buffer() method to block until ProcessOutput() is called and we have a buffer; I don't know if you had a better plan than that.
A bit like this, but it's actually more tricky because some decoders aren't playing ball.
They are happy to allocate buffers from a downstream pool, but they won't release the buffers so easily. I understand that they keep the buffers for previous frame reference in their decoding process, and that needs to outlive the app provided output buffer.
Because they hold a ref on the buffer, it isn't writable and we cannot detach its memory either.
On the other hand, it's not clear to me how we're going to deal with MF_E_TRANSFORM_STREAM_CHANGE or MFT_OUTPUT_STATUS_SAMPLE_READY while maintaining zero-copy. What is your plan for those?
So my plan, which will also cover the case above, is to use instead a custom allocator (going to be generally useful imho if we want zero-copy in wg_parser too), which will allocate memory normally with an optional app provided memory overlay.
When allocation needs to be done without an app provided buffer, the allocator will work as the default allocator, allocate unix side memory, and return a pointer to it on map. This is for instance going to happen if more than one buffer gets decoded. In that case, on the next ProcessOutput call, the data will have to be copied.
When we have an app provided buffer, and in the good cases, it would instead overlayed the unix memory with it and return the app memory on map so the decoder writes directly to it. When the app buffer needs to be released, we remove the overlay and the unix memory will later be used if there's any re-use.
In the bad cases, when either the output buffer is still referenced by the decoder, or when we actually should keep the output sample pending, and when we have incorrectly overlayed app buffer, we will copy back the data to the unix memory before detaching the overlay. In any case we still get only one buffer copy, so not worst than the normal copy.
In practice the bad cases shouldn't happen too often: when decoder hold on the buffers, we cannot remove memory from them. This also means the pool will not discard the buffers on release, and after a few allocs the decoder will have enough buffers to use on them and the allocator will not receive any more allocation requests. We will have to copy the data anyway, but in the right direction this time.
In the good cases, releasing buffer memory will make sure to taint the buffers and the pool will free them on release. The allocator will receive a request every time a buffer is needed.