On my Steam Deck, this reduces the time it takes to initialize the wg_parser radically (by around 1 second). This helps in the game WILD HEARTS, which doesn't present while loading help videos during gameplay, causing large stutters.
Because GStreamer can randomly access the file with no known pattern on our side, I opted to implement this by buffering 4 chunks so that interleaved reads to different areas of the file don't cause us to discard and reload cached data more than we need to.
-- v2: winegstreamer: Cache wg_parser input data.
From: Derek Lesho dlesho@codeweavers.com
In order to reduce wg_parser initialization time by skipping the round-trip to the PE thread.
Signed-off-by: Derek Lesho dlesho@codeweavers.com --- dlls/winegstreamer/wg_parser.c | 117 +++++++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+)
diff --git a/dlls/winegstreamer/wg_parser.c b/dlls/winegstreamer/wg_parser.c index 5bb824f4399..4a49c43f7f2 100644 --- a/dlls/winegstreamer/wg_parser.c +++ b/dlls/winegstreamer/wg_parser.c @@ -59,6 +59,12 @@ GST_DEBUG_CATEGORY(wine);
typedef BOOL (*init_gst_cb)(struct wg_parser *parser);
+struct input_cache_chunk +{ + guint64 offset; + uint8_t *data; +}; + struct wg_parser { init_gst_cb init_gst; @@ -96,7 +102,10 @@ struct wg_parser bool unlimited_buffering;
gchar *sink_caps; + + struct input_cache_chunk input_cache_chunks[4]; }; +static unsigned int wg_parser_chunk_size = 512 << 10;
struct wg_parser_stream { @@ -967,10 +976,87 @@ static void pad_removed_cb(GstElement *element, GstPad *pad, gpointer user) g_free(name); }
+static GstFlowReturn read_input_cache(struct wg_parser *parser, guint64 offset, guint size, GstBuffer *buffer) +{ + unsigned int read_size, bytes_read = 0; + struct input_cache_chunk chunk; + GstBuffer *chunk_buffer; + GstFlowReturn ret; + int i; + + while (size) + { + for (i = 0; i < ARRAY_SIZE(parser->input_cache_chunks); i++) + { + chunk = parser->input_cache_chunks[i]; + + if (chunk.data && offset - chunk.offset < wg_parser_chunk_size) + { + read_size = min(size, wg_parser_chunk_size - (offset - chunk.offset)); + gst_buffer_fill(buffer, bytes_read, chunk.data + (offset - chunk.offset), read_size); + + if (i != 0) + { + memmove(&parser->input_cache_chunks[1], &parser->input_cache_chunks[0], i); + parser->input_cache_chunks[0] = chunk; + } + + size -= read_size; + offset += read_size; + bytes_read += read_size; + + break; + } + } + + if (i != ARRAY_SIZE(parser->input_cache_chunks)) + continue; + + chunk = parser->input_cache_chunks[ ARRAY_SIZE(parser->input_cache_chunks) - 1 ]; + + chunk.offset = offset - (offset % wg_parser_chunk_size); + if (!chunk.data) + chunk.data = malloc(wg_parser_chunk_size); + + chunk_buffer = gst_buffer_new_wrapped_full(0, chunk.data, wg_parser_chunk_size, 0, wg_parser_chunk_size, NULL, NULL); + + pthread_mutex_lock(&parser->mutex); + + assert(!parser->read_request.size); + parser->read_request.buffer = chunk_buffer; + parser->read_request.offset = chunk.offset; + parser->read_request.size = wg_parser_chunk_size; + parser->read_request.done = false; + pthread_cond_signal(&parser->read_cond); + + while (!parser->read_request.done) + pthread_cond_wait(&parser->read_done_cond, &parser->mutex); + + ret = parser->read_request.ret; + + pthread_mutex_unlock(&parser->mutex); + + gst_buffer_unref(chunk_buffer); + + if (ret != GST_FLOW_OK) + { + if (!parser->input_cache_chunks[ ARRAY_SIZE(parser->input_cache_chunks) - 1 ].data) + free(chunk.data); + return ret; + } + + memmove(&parser->input_cache_chunks[1], &parser->input_cache_chunks[0], ARRAY_SIZE(parser->input_cache_chunks) - 1); + parser->input_cache_chunks[0] = chunk; + } + + return GST_FLOW_OK; +} + static GstFlowReturn src_getrange_cb(GstPad *pad, GstObject *parent, guint64 offset, guint size, GstBuffer **buffer) { struct wg_parser *parser = gst_pad_get_element_private(pad); + GstBuffer *working_buffer = *buffer; GstFlowReturn ret;
GST_LOG("pad %p, offset %" G_GINT64_MODIFIER "u, size %u, buffer %p.", pad, offset, size, *buffer); @@ -991,6 +1077,28 @@ static GstFlowReturn src_getrange_cb(GstPad *pad, GstObject *parent, return GST_FLOW_OK; }
+ if (size < wg_parser_chunk_size) + { + if (offset >= parser->file_size) + return GST_FLOW_EOS; + + if ((offset + size) >= parser->file_size) + size = parser->file_size - offset; + + if (!working_buffer) + working_buffer = gst_buffer_new_and_alloc(size); + + if ((ret = read_input_cache(parser, offset, size, working_buffer)) != GST_FLOW_OK) + { + if (!*buffer) + gst_buffer_unref(working_buffer); + return ret; + } + + *buffer = working_buffer; + return GST_FLOW_OK; + } + pthread_mutex_lock(&parser->mutex);
assert(!parser->read_request.size); @@ -1562,6 +1670,15 @@ static NTSTATUS wg_parser_disconnect(void *args) g_free(parser->sink_caps); parser->sink_caps = NULL;
+ for (i = 0; i < ARRAY_SIZE(parser->input_cache_chunks); i++) + { + if (parser->input_cache_chunks[i].data) + { + free(parser->input_cache_chunks[i].data); + parser->input_cache_chunks[i].data = NULL; + } + } + return S_OK; }
This will take up up to 32 MiB per wg_parser
I looked up online about what chunk size reads on an SSD are, and found the number 512kb. Setting the chunk to this size seems to actually *improve* the performance, presumably because we're spending less time reading/copying data we don't need at the moment. Do you think this is small enough (512kb per chunk * 4 chunks = 2mb per parser).
I don't know if there's any size that's not going to make me nervous, frankly. Ideally the cache should be as small as possible without losing the performance gain.
And along similar lines, is it even worth falling back to the old path for larger read requests?
The advantage to falling back that I can think of is that it's one less copy, as the data is first copied into the cache only to be read out to the buffer afterwards. How important this is to performance probably depends on how often we hit the read_size > chunk_size path. In my test case, I don't think it would hurt too much, but in the case where GStreamer reads the whole file at once, maybe this would make things worse. If it's okay I'll let you make the judgement call.
I'd honestly be surprised if the read size is ever going to be that large in practice. Despite that I don't have a strong opinion, but if you need one I'd say to just always use the cache. It'd be easy to add a direct path later anyway.
If not, we should factor out a helper to actually perform the read pseudo-callback.
I'm a bit confused here, if it's not worth it to fall back to the old path, then we'd only have one area performing the read pseudo callback, the function reading/loading from the cache into the buffer, right? If you meant the other way around, where we keep both paths then yeah, I can make that helper.
Yes, I meant the other way around, sorry.
The parser mutex doesn't seem to be taken around everything that it should be.
I'm pretty sure getrange doesn't need to be implemented as thread-safe, as GStreamer takes a lock to the pad anyway before invoking the callback:
https://github.com/GStreamer/gstreamer/blob/2db3ddaa9d8e9e0a1659d7e48623c512...
Yes, you're right, sorry.
I've done some performance testing, and here are my results (milliseconds between the two presents when a video starts playing in WILD HEARTS):
keeping fallback path for large reads:
512kb chunk size: 176ms, 176ms, 182ms, 176ms, 186ms
256kb chunk size: 180ms, 193ms, 189ms, 213ms, 204ms
128kb chunk size: 304ms, 311ms, 302ms, 307ms, 314ms
everything goes through cache:
256kb chunk size: 210ms, 199ms, 220ms, 195ms, 211ms
Also looked a bit at a 32kb chunk size, and here we were getting in the range of 400-500ms with the fallback path for larger reads, but without that a full 7 seconds.
Taking all that into consideration, I'll keep the fallback path out of cautiousness, keep the chunk size at 512kb, and disable the path entirely for 32-bit, unless there are any objections.