It is possible that a stream is destroyed while another thread is waiting on event_empty_cond or event_cond. The waiting thread should have the opportunity to be rescheduled and exit the critical section before the condition variables and the mutex are destroyed.
-- v2: winegstreamer: Do not destroy condition variables when a thread is waiting on them.
From: Giovanni Mascellani gmascellani@codeweavers.com
It is possible that a stream is destroyed while another thread is waiting on event_empty_cond or event_cond. The waiting thread should have the opportunity to be rescheduled and exit the critical section before the condition variables and the mutex are destroyed. --- dlls/winegstreamer/wg_parser.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-)
diff --git a/dlls/winegstreamer/wg_parser.c b/dlls/winegstreamer/wg_parser.c index 0573c99858b..feead069391 100644 --- a/dlls/winegstreamer/wg_parser.c +++ b/dlls/winegstreamer/wg_parser.c @@ -102,13 +102,15 @@ struct wg_parser_stream GstSegment segment; struct wg_format preferred_format, current_format;
- pthread_cond_t event_cond, event_empty_cond; + pthread_cond_t event_cond, event_empty_cond, free_cond; GstBuffer *buffer; GstMapInfo map_info;
- bool flushing, eos, enabled, has_caps; + bool flushing, eos, enabled, has_caps, freeing;
uint64_t duration; + + unsigned int waiters; };
static NTSTATUS wg_parser_get_stream_count(void *args) @@ -270,8 +272,15 @@ static NTSTATUS wg_parser_stream_get_buffer(void *args)
pthread_mutex_lock(&parser->mutex);
- while (!stream->eos && !stream->buffer) + while (!stream->eos && !stream->buffer && !stream->freeing) + { + stream->waiters++; pthread_cond_wait(&stream->event_cond, &parser->mutex); + stream->waiters--; + } + + if (stream->freeing && stream->waiters == 0) + pthread_cond_signal(&stream->free_cond);
/* Note that we can both have a buffer and stream->eos, in which case we * must return the buffer. */ @@ -547,7 +556,14 @@ static GstFlowReturn sink_chain_cb(GstPad *pad, GstObject *parent, GstBuffer *bu * implementing a queue object here. */
while (stream->enabled && !stream->flushing && stream->buffer) + { + stream->waiters++; pthread_cond_wait(&stream->event_empty_cond, &parser->mutex); + stream->waiters--; + } + + if (stream->freeing && stream->waiters == 0) + pthread_cond_signal(&stream->free_cond);
if (!stream->enabled) { @@ -694,6 +710,7 @@ static struct wg_parser_stream *create_stream(struct wg_parser *parser) stream->current_format.major_type = WG_MAJOR_TYPE_UNKNOWN; pthread_cond_init(&stream->event_cond, NULL); pthread_cond_init(&stream->event_empty_cond, NULL); + pthread_cond_init(&stream->free_cond, NULL);
sprintf(pad_name, "qz_sink_%u", parser->stream_count); stream->my_sink = gst_pad_new(pad_name, GST_PAD_SINK); @@ -708,6 +725,13 @@ static struct wg_parser_stream *create_stream(struct wg_parser *parser)
static void free_stream(struct wg_parser_stream *stream) { + pthread_mutex_lock(&stream->parser->mutex); + stream->freeing = true; + pthread_cond_signal(&stream->event_cond); + while (stream->waiters != 0) + pthread_cond_wait(&stream->free_cond, &stream->parser->mutex); + pthread_mutex_unlock(&stream->parser->mutex); + if (stream->their_src) { if (stream->post_sink) @@ -722,6 +746,7 @@ static void free_stream(struct wg_parser_stream *stream)
pthread_cond_destroy(&stream->event_cond); pthread_cond_destroy(&stream->event_empty_cond); + pthread_cond_destroy(&stream->free_cond);
free(stream); }
Hi,
It looks like your patch introduced the new failures shown below. Please investigate and fix them before resubmitting your patch. If they are not new, fixing them anyway would help a lot. Otherwise please ask for the known failures list to be updated.
The tests also ran into some preexisting test failures. If you know how to fix them that would be helpful. See the TestBot job for the details:
The full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=125850
Your paranoid android.
=== debian11 (32 bit report) ===
wmvcore: wmvcore.c:1663: Test failed: Stream 0: Format 3: Got hr 0xc00d0041. wmvcore.c:1680: Test failed: Stream 0: Format 3: Media types didn't match.
Ouch, I think this triggered unrelated review requests because I initially pushed some unrelated development commits. Sorry!
I don't think we should add more condition variables to this code, there's already way too many.
FWIW I've got some local changes of the wg_parser design, described in length in https://gitlab.winehq.org/wine/wine/-/merge_requests/1048, and waiting on it to be accepted, that would solve the issue described here as well.
It would probably make more sense just to move the actual cleanup to wg_parser_destroy(), although in order to do that we'll also need to call wg_parser_destroy() on the quartz parser earlier.
It would probably make more sense just to move the actual cleanup to wg_parser_destroy(), although in order to do that we'll also need to call wg_parser_destroy() on the quartz parser earlier.
I don't think that would change the issue substantially. The problem is that currently there is no synchronization between when the the reader thread calling `wg_parser_stream_get_buffer()` is done using the stream resources (particularly the condition variables) and when these resources are destroyed. Moving destruction to `wg_parser_destroy()`, in itself, wouldn't change much, because the client could call `RequestSample()` and then immediately `Shutdown()` and `Release()`.
On Tue Nov 8 12:34:59 2022 +0000, Giovanni Mascellani wrote:
It would probably make more sense just to move the actual cleanup to
wg_parser_destroy(), although in order to do that we'll also need to call wg_parser_destroy() on the quartz parser earlier. I don't think that would change the issue substantially. The problem is that currently there is no synchronization between when the the reader thread calling `wg_parser_stream_get_buffer()` is done using the stream resources (particularly the condition variables) and when these resources are destroyed. Moving destruction to `wg_parser_destroy()`, in itself, wouldn't change much, because the client could call `RequestSample()` and then immediately `Shutdown()` and `Release()`.
Right, I forgot how the other frontends than quartz worked.
On Tue Nov 8 18:37:39 2022 +0000, Zebediah Figura wrote:
Right, I forgot how the other frontends than quartz worked.
Actually, that implies a deeper synchronization problem in mfplat. If we can't clean up the condition variable, then we can't free the filter either. Do we know that Shutdown() should unblock RequestSample()? It seems quite plausible that it shouldn't (after all, I don't think RequestSample() should ever deadlock—it just might take a few ms to complete), and in that case it's possible that we should be doing some sort of locking on the mfplat side.
Alternatively, mfplat could move the whole wg_parser destruction to Release, on the idea that any concurrent thread calling RequestSample() should have a reference anyway. (I know that in theory mfplat should clean up everything in Shutdown, but as I understand that's just to break circular refcounts within mfplat, so it wouldn't apply here.)
On Tue Nov 8 18:45:05 2022 +0000, Zebediah Figura wrote:
Actually, that implies a deeper synchronization problem in mfplat. If we can't clean up the condition variable, then we can't free the filter either. Do we know that Shutdown() should unblock RequestSample()? It seems quite plausible that it shouldn't (after all, I don't think RequestSample() should ever deadlock—it just might take a few ms to complete), and in that case it's possible that we should be doing some sort of locking on the mfplat side. Alternatively, mfplat could move the whole wg_parser destruction to Release, on the idea that any concurrent thread calling RequestSample() should have a reference anyway. (I know that in theory mfplat should clean up everything in Shutdown, but as I understand that's just to break circular refcounts within mfplat, so it wouldn't apply here.)
I don't know if `Shutdown()` should unblock `RequestSample()` or not, and it doesn't appear to be easy to check. I can theoretically think of a test for that (by creating a media source over some byte stream specifically crafted so that it start blocking at some point), but it seems a lot of work for a detail that is probably not particularly important to reproduce correctly.
But yeah, the idea is that one of the three things you mention should happen: 1. `Shutdown()` unblocks `RequestSample()`, which presumably generates an error event; 2. `Shutdown()` waits for `RequestSample()` to return, hoping that it does quickly; (though this is not really obvious: the reason why this bug became apparent to me was that in some cases GStreamer is not able to produce a sample for a stream unless all the other streams are actively pumping out samples, I guess because of buffering reasons); 3. `Shutdown()` only does the cleanup part needed to break refcount cycles, and defer the rest to the destructor; implicitly, this still means that destruction has to wait for `RequestSample()` to finish.
Solutions 2 and 3 have a problem when `RequestSample()` is never going to finish, which probably should never happen, but in practice it does (and that's precisely the reason why I became aware of this bug). Specifically, if a file contains a lot of audio samples and no video samples for a while, and the application requests a video sample without receiving the audio samples, it is possible that GStreamer never produces the video sample (for reasons that I presume related to not having enough look ahead in the file), so everything is deadlocked.
I guess this should be a bug in itself to solve, but it might require some more architectural work. Solution 1 doesn't depend on that being solved, which is why I ended up coding solution 1.
Actually, that implies a deeper synchronization problem in mfplat. If we can't clean up the condition variable, then we can't free the filter either.
Mmh, I think you're right. For example, it might be that `Shutdown()` is called when after a `RequestSample()`, but before the `SOURCE_ASYNC_REQUEST_SAMPLE` is being serviced. Thus `Shutdown()` would happily clean up everything, and then `wait_on_sample()` will try to use an object that has been destructed. So the solution must also do something at the media source level, which incidentally looks a bit under-synchronized (for example, `struct media_source::state` is concurrently accessed in many places without any mutex). I'll try to fix the MR.
Solutions 2 and 3 have a problem when `RequestSample()` is never going to finish, which probably should never happen, but in practice it does (and that's precisely the reason why I became aware of this bug). Specifically, if a file contains a lot of audio samples and no video samples for a while, and the application requests a video sample without receiving the audio samples, it is possible that GStreamer never produces the video sample (for reasons that I presume related to not having enough look ahead in the file), so everything is deadlocked.
This bothers me, actually, because in theory gstreamer's multiqueue should prevent this. It's designed to try to buffer streams evenly, but if one stream is getting starved it *should* allow to increase the buffering limits on the others to prevent a deadlock. I think we should investigate why this isn't working correctly.
I know we have the "unlimited buffering" mechanism in place, but in theory that should be solved by the same mechanism, so it probably shouldn't be there either.
This bothers me, actually, because in theory gstreamer's multiqueue should prevent this. It's designed to try to buffer streams evenly, but if one stream is getting starved it *should* allow to increase the buffering limits on the others to prevent a deadlock. I think we should investigate why this isn't working correctly.
Maybe the answer is trivial: soon after `Shutdown()` is called, `gst_element_set_state(GST_STATE_NULL)` is called, and I guess it's totally reasonable that after that GStreamer stops sending samples. So either we go with solution 1 or we have to rearchitecture the shutdown logic so that when `Shutdown()` is called the WG parser first checks that at least one sample is available for each enabled stream and then stops GStreamer. And right now I fails to see which advantages that would bring (until we have a proof that native does solution 2 or 3).
This bothers me, actually, because in theory gstreamer's multiqueue should prevent this. It's designed to try to buffer streams evenly, but if one stream is getting starved it *should* allow to increase the buffering limits on the others to prevent a deadlock. I think we should investigate why this isn't working correctly.
Maybe the answer is trivial: soon after `Shutdown()` is called, `gst_element_set_state(GST_STATE_NULL)` is called, and I guess it's totally reasonable that after that GStreamer stops sending samples.
I don't understand, this seems orthogonal to the problem you're describing, where refusing to read from one stream can starve the other?
I don't understand, this seems orthogonal to the problem you're describing, where refusing to read from one stream can starve the other?
What I am saying is that there is probably no problem with one stream starving another one, as it should be. At some point I thought that was the problem, because I noticed that a stream wasn't producing samples any more, but that probably happened because `Shutdown()` had been called in the mean time.
What I am saying is that there is probably no problem with one stream starving another one, as it should be. At some point I thought that was the problem, because I noticed that a stream wasn't producing samples any more, but that probably happened because `Shutdown()` had been called in the mean time.
Ah, thanks for clarifying.
I think that if the correct solution is (2), then we probably want to deal with it on the mfplat side. Actually, in theory it should be as simple as putting a mutex around Shutdown() and RequestSample(); I'm not sure why there isn't one already...
I assume there *is* still a problem with starvation, given that we have that "unlimited buffering" thing going on anyway, but it's probably orthogonal. Fortunately we have a record of which application to retest :-)