## Context
This is a problem discovered when running "Max: The Curse of Brotherhood" under proton. This game writes 4MB a piece into `wg_transform`, which contains about 4 seconds of compressed video. `wg_transform` calls `gst_pad_push` in `transform_ProcessOutput`, which takes ~300ms (i.e. 20 frames @ 60fps) to decode this buffer of video. So the end result is the game hitches every 4 seconds while playing cut scene videos.
Proton currently has a special case for this particular game, for which it breaks up the buffer into small chunks to avoid long blocks. This MR adopts that and applies it generally.
One concern raised by @redmcg is:
The only issue I can think of is if there are any decoders which don't have a parser; then they might not know how to deal with an arbitrary 4096 byte buffer (the parser is generally responsible for taking arbitrary buffers and producing something the decoder can work with, for example: a full frame).
So this MR only enables this strategy when there is a parser element.
From: Yuxuan Shui yshui@codeweavers.com
--- dlls/winegstreamer/wg_transform.c | 44 ++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 7 deletions(-)
diff --git a/dlls/winegstreamer/wg_transform.c b/dlls/winegstreamer/wg_transform.c index 558f6572a1b..50c1958013b 100644 --- a/dlls/winegstreamer/wg_transform.c +++ b/dlls/winegstreamer/wg_transform.c @@ -53,16 +53,18 @@ struct wg_transform GstQuery *drain_query;
GstAtomicQueue *input_queue; + GstBuffer *current_buffer; MFVideoInfo input_info; MFVideoInfo output_info;
GstAtomicQueue *output_queue; GstSample *output_sample; - bool output_caps_changed; GstCaps *desired_caps; GstCaps *output_caps; GstCaps *input_caps;
+ bool output_caps_changed; + bool has_parser; bool draining; };
@@ -492,8 +494,12 @@ static bool transform_create_decoder_elements(struct wg_transform *transform,
if ((element = find_element(GST_ELEMENT_FACTORY_TYPE_PARSER, transform->input_caps, parsed_caps)) && !append_element(transform->container, element, first, last)) + { + transform->has_parser = true; goto done; - else if (!element) + } + + if (!element) { gst_caps_unref(parsed_caps); parsed_caps = gst_caps_ref(transform->input_caps); @@ -1035,18 +1041,42 @@ error:
static bool get_transform_output(struct wg_transform *transform, struct wg_sample *sample) { - GstBuffer *input_buffer; GstFlowReturn ret;
wg_allocator_provide_sample(transform->allocator, sample);
- while (!(transform->output_sample = gst_atomic_queue_pop(transform->output_queue)) - && (input_buffer = gst_atomic_queue_pop(transform->input_queue))) + while (!(transform->output_sample = gst_atomic_queue_pop(transform->output_queue))) { - if ((ret = gst_pad_push(transform->my_src, input_buffer))) + /* A big buffer could take the transform a long time (longer than a vblank) to + * process, so if we write the entire input buffer in one go, we could block and + * cause application to hitch. So we break the buffer up, write just enough to + * get something back and stop. However, doing so is only safe if we are writing + * into a parser, as it knows to put things back up properly. */ + + gsize target_push_size = 4096, size, maxsize, offset; + GstBuffer *tmp_buffer = NULL; + if (!transform->current_buffer) + transform->current_buffer = gst_atomic_queue_pop(transform->input_queue); + if (!transform->current_buffer) + break; + + size = gst_buffer_get_sizes(transform->current_buffer, &offset, &maxsize); + if (size < target_push_size || !transform->has_parser) + target_push_size = size; + + tmp_buffer = gst_buffer_copy_region(transform->current_buffer, + GST_BUFFER_COPY_METADATA | GST_BUFFER_COPY_MEMORY, 0, target_push_size); + if ((ret = gst_pad_push(transform->my_src, tmp_buffer))) GST_WARNING("Failed to push transform input, error %d", ret);
- complete_drain(transform); + if (target_push_size < size) + gst_buffer_resize(transform->current_buffer, target_push_size, -1); + else + { + gst_buffer_unref(transform->current_buffer); + transform->current_buffer = NULL; + complete_drain(transform); + } }
/* Remove the sample so the allocator cannot use it */
What does native do?
I'm not sure we can tell what native does exactly here but IMO, and especially as we're doing this only if we have a parser, this should instead use the parser in a standalone way to split the data into individual packets with the correct boundaries. This could probably be done in push_data, then every parsed buffer would be queued instead of the input data itself.
On Tue Feb 11 17:12:53 2025 +0000, Rémi Bernon wrote:
I'm not sure we can tell what native does exactly here but IMO, and especially as we're doing this only if we have a parser, this should instead use the parser in a standalone way to split the data into individual packets with the correct boundaries. This could probably be done in push_data, then every parsed buffer would be queued instead of the input data itself.
oh, you mean we take the parser element out and manually drive it ourselves, instead of letting gstreamer do it? i can look into if this is doable. but it feels like we are using gstreamer in a way that it doesn't want to be used in. i heard we might be moving past gstreamer eventually?
Eh, sorry, my question was based on a misreading—these aren't coming from our demuxer.
I'm not sure we can tell what native does exactly here but IMO, and especially as we're doing this only if we have a parser, this should instead use the parser in a standalone way to split the data into individual packets with the correct boundaries. This could probably be done in push_data, then every parsed buffer would be queued instead of the input data itself.
Why is that better?
On Wed Feb 12 00:47:29 2025 +0000, Elizabeth Figura wrote:
Eh, sorry, my question was based on a misreading—these aren't coming from our demuxer.
I'm not sure we can tell what native does exactly here but IMO, and
especially as we're doing this only if we have a parser, this should instead use the parser in a standalone way to split the data into individual packets with the correct boundaries. This could probably be done in push_data, then every parsed buffer would be queued instead of the input data itself. Why is that better?
i think by doing that, we don't need to break the buffer into an arbitrary size? e.g. 4kbytes as chosen by this MR.
but i guess then there would be the problem that do we know the parser would always break the buffer up reasonably? (e.g. one chunk per frame). what if the parser also spit out enormous buffers.
i feel the only "clean" solution would be running the decoder in a separate thread, which would allow decoding to happen asynchronously.
but i guess then there would be the problem that do we know the parser would always break the buffer up reasonably? (e.g. one chunk per frame). what if the parser also spit out enormous buffers.
In practice parsers should output individual framed buffers; that's part of the point of the parser.
On Wed Feb 12 00:55:33 2025 +0000, Elizabeth Figura wrote:
but i guess then there would be the problem that do we know the parser
would always break the buffer up reasonably? (e.g. one chunk per frame). what if the parser also spit out enormous buffers. In practice parsers should output individual framed buffers; that's part of the point of the parser.
so i have no idea how h264 works so i had to do some research. but looks like h264parse splits thing into what are called NALUs, not frames. in this case, NALUs are actually sub-frame (at most a frame), so we are good for this particular encoder.
but in general it feels dangerous to assume parser will always break buffer into frames or sub-frames. but maybe that's just me.
On Wed Feb 12 13:35:29 2025 +0000, Yuxuan Shui wrote:
so i have no idea how h264 works so i had to do some research. but looks like h264parse splits thing into what are called NALUs, not frames. in this case, NALUs are actually sub-frame (at most a frame), so we are good for this particular encoder. but in general it feels dangerous to assume parser will always break buffer into frames or sub-frames. but maybe that's just me.
Regardless of how the parser decides to emit output buffers, it'll always be in best case smaller than this arbitrary split and in worst case give identical results.
On Wed Feb 12 13:40:43 2025 +0000, Rémi Bernon wrote:
Regardless of how the parser decides to emit output buffers, it'll always be in best case smaller than this arbitrary split and in worst case give identical results.
No necessarily - the parser can accumulate multiple buffers and only split at the boundaries it finds suitable.
On Wed Feb 12 14:06:37 2025 +0000, Yuxuan Shui wrote:
No necessarily - the parser can accumulate multiple buffers and only split at the boundaries it finds suitable.
Which it would also do with this arbitrary split. If this works then parsers don't accumulate more than necessary, which is likely frame or sub-frame granularity.
Using the parser standalone for splitting is also going to help zero copy: we can only write to the client buffers one frame at a time. Pushing data that decodes to more than one frame will require these frames to be decoded into unix-side buffers and copies to be done later to the client buffers.
On Wed Feb 12 14:57:44 2025 +0000, Rémi Bernon wrote:
Which it would also do with this arbitrary split. If this works then parsers don't accumulate more than necessary, which is likely frame or sub-frame granularity. Using the parser standalone for splitting is also going to help zero copy: we can only write to the client buffers one frame at a time. Pushing data that decodes to more than one frame will require these frames to be decoded into unix-side buffers and copies to be done later to the client buffers.
re: how `h264parse` splits; if you run `gst-inspect-1.0 h264parse`, you'll see one of the caps is `alignment`, which has two options: 1. `au`; and 2. `nal`
So it can split the buffers in one of two ways: 1. at the `au` boundary (which stands for 'Access Unit', and represents a frame (or a field for interlaced); or 2. at the 'nal' boundary. These are smaller than access units and represent a logical bundle of data (which can be metadata or image data). An AU may have several NALs, but a NAL will never overlap two AUs.
On Wed Feb 12 21:24:38 2025 +0000, Brendan McGrath wrote:
re: how `h264parse` splits; if you run `gst-inspect-1.0 h264parse`, you'll see one of the caps is `alignment`, which has two options:
- `au`; and
- `nal`
So it can split the buffers in one of two ways:
- at the `au` boundary (which stands for 'Access Unit', and represents
a frame (or a field for interlaced); or 2. at the 'nal' boundary. These are smaller than access units and represent a logical bundle of data (which can be metadata or image data). An AU may have several NALs, but a NAL will never overlap two AUs.
so, turns out pushing a large buffer through just the parser is enough to cause stutters. it's way better than having it go through the whole decoding pipeline, but it's still noticeable.
On Tue Feb 25 15:42:19 2025 +0000, Yuxuan Shui wrote:
so, turns out pushing a large buffer through just the parser is enough to cause stutters. it's way better than having it go through the whole decoding pipeline, but it's still noticeable. (Edit: i should try again without gst logging.)
yeah the stutter is gone after logging is disabled, but does still makes me worry as it would indicate that the headroom is pretty small