On 3/27/20 4:26 PM, Derek Lesho wrote:
On 3/27/20 3:11 PM, Zebediah Figura wrote:
On 3/27/20 1:08 PM, Derek Lesho wrote:
On 3/27/20 11:32 AM, Zebediah Figura wrote:
On 3/27/20 10:05 AM, Derek Lesho wrote:
On 3/26/20 4:56 PM, Zebediah Figura wrote:
There's another broad question I have with this approach, actually, which is fundamental enough I have to assume it's at had some thought put into it, but it would be nice if that discussion happened in a more public place, and was justified in the patches sent.
Essentially, the question is: what if we were to use decodebin directly?
As I understand (and admittedly Media Foundation is far more complex than I could hope to understand) an application which just calls IMFSourceResolver methods just needs to get back a working IMFMediaSource, and we could wrap decodebin with one of those, similar to the quartz wrapper.
The most basic applications (games) seem to either use a source reader or simple sample grabber media session to get their raw samples. If you want to add a hack for using decodebin, you can easily add a special source type, and for the media source of that type, just make a decodebin element instead of searching for a demuxer. In this case, the source reader wouldn't search for a decoder since the output type set by the application would be natively supported by the source. Then, as part of the hack, just always yield that source type in the source resolver. This is completely incorrect and probably shouldn't make it's way into mainline, IMO. Also, I have reason to believe it may break Unity3D, as they do look at the native media types supported by the source, and getting around this would require adding some hackery in the source reader.
My assertion is this isn't really a "hack".
I think that if you have to modify media foundation code to workaround shortcuts in winegstreamer, it can be classified as a hack. It is probable that most games will work with it, but I think it makes more sense as a staging enhancement.
There's nothing we have to modify in core Media Foundation code (though modifying bytesteream_get_url_hint() would help). I disagree with your assertion that it's a hack, though. Or, more saliently, that differing from Windows' implementation details inherently means it's bad and wrong.
I'd also point out that "enhancements that aren't suitable for upstream" isn't a purpose of Staging. "Patches that aren't good enough for upstream yet" is.
This is something that's reasonable to do, and that fits within the design of Media Foundation.
I have a hard time subscribing to the idea that this is within the design of media foundation. I took a look on github, and a good amount of applications find desired streams using the subtype from the source reader's GetNativeMediaType. If we were to output uncompressed types, this would break. To work around this, we'd either have to expose incorrect media types on our streams, and add an exception to the decoder finding behavior in the source reader and topology loader, or expose some private interface for getting the true native types. And in either case, we'd still have to conversion of caps for a compressed media type.
So, I went to see what exactly these programs were doing with GetNativeMediaType(). I figured I'd check the first ten unique ones, skipping anything that looks like a binding or wrapper, and here's what I came up with:
https://github.com/KennethEvans/VS-Audio uses it in one place to test whether a stream is present, in another place just to dump the type to stdout, and in a third place to get the type from a capture device.
https://github.com/Csineneo/Vivaldi uses it in one place just to retrieve the major type, and in another place to get the type from a capture device.
https://github.com/Hanumanthu2020/HanuWork uses it for capture devices.
https://github.com/clarkezone/audiovisualizer uses it in one place to test the major type and subtype; it checks if the subtype is mp3 but doesn't do anything with that information. It uses it in another place passed through to its own API.
https://github.com/nickluo/camaro-sdk uses it for video capture.
https://github.com/mrojkov/Citrus uses it to get the major type, width, and height.
https://github.com/ms-iot/ros_win_camera uses it for image/video capture.
https://github.com/daramkun/SamplePlay uses it to get the frame rate, width, and height of a video stream, and the number of channels, sample rate, and bit depth of an audio stream. It outputs to uncompressed samples, and in the case of the latter it uses those parameters to determine a PCM type. (Even though not all audio types have a "bit depth"...)
https://github.com/vipoo/SuperMFLib uses it in one place to check the major type. It uses it in another place to get the frame rate, PAR, width, and height of a video stream, and the number of channels and sample rate of an audio stream. It outputs to uncompressed samples.
https://github.com/Brhsoftco/PlexDL-MetroSet_UI uses it for video capture.
The conclusion I draw from this is:
- most applications which call GetNativeMediaType() are doing so on
capture sources [which, it goes without saying, are outside the scope of gstreamer],
- the rest only care about details that wouldn't change from decoding:
major type, frame rate, width, height, PAR, channel count, sample rate,
- none of the applications concerned with decoding audio actually set
the media type to be the native media type.
The conclusion I draw from this is that incorrect behavior is always one attribute retrieval away, with no easy/straightforward fix.
The point is that if we're looking at what applications actually do in practice, the evidence supports that they rarely if ever depend on the native media type.
In the case that they do, I don't think the fix is as difficult as all that. Moreover, the work has already been done, and would need only be adapted.
It's changing the implementation details, not the API contract. We have the freedom to do that.
First of all, this is something I think we want to do anyway. Microsoft has no demuxer for, say, Vorbis (at least, there's not one registered on my Windows 10 machine), but I think that we want to be able to play back Vorbis files anyway (in, say, a Win32 media player application).
I'm pretty sure our goal is not to extend windows functionality.
Actually, I'd assert the opposite. Host integration has always been a feature of Wine, not a bug. That goes beyond just mapping program launcher entries to .desktop files; it includes things like:
- mapping host devices to DOS drives,
- allowing unix paths to be used in file system functions,
- exposing the unix file system as a shell folder,
- making winebrowser the default browser (instead of explorer),
- exposing public Wine-specific exports from ntdll (those not prefixed
with a double underscore),
- making use of host credentials in advapi32 (on Mac, anyway),
- exposing host GStreamer and QuickTime codecs in DirectShow.
We extend host functionality to integrate with the system, and to make using Wine easier. Using host codecs from mfplat does both.
I'm unsure why anyone would want to use a windows media player over something like VLC.
I'm unsure as well, but that's not really our place to judge. We just make the software work where we can. If I had to guess, though, I'd say that native media players offer a UI that the user prefers, includes some feature that native players don't, is more familiar to the user (who may have recently migrated from Windows)...
But as I mentioned earlier, it is possible to add a hack using decodebin with minimal effort, and we could possibly only use this hack as a fallback if the container have a registered byte stream handler. I think we would get the best of both worlds with this solution.
In a sense I'm kind of proposing exactly that, except that we rely on decodebin first, and only add other sources if it turns out that decodebin doesn't work for something.
Call me crazy, but I think the accurate solution should be the default, not the fallback :P
In a vacuum, that kind of maxim would be true. But there's a lot more to consider here: how inaccurate our implementation details actually are, how likely an application is to care, how much simpler or clearer it makes our code. The answers seem to me to support by far that it's reasonable to use decodebin as the default. That there's precedent in quartz also helps, I think.
Instead
of writing yet another source for vorbis,
You don't "write another source", you just expose a new source object and link it with a new source_desc structure, which specifies the mime type of the container format: https://github.com/Guy1524/wine/blob/mfplat_rebase/dlls/winegstreamer/media_...
and for each other obscure
format, we just write one generic decodebin wrapper.
Not to mention, you'd have to perform this step with a decodebin wrapper anyway.
The amount of abstraction, and the amount of actual code you have to add, is beside the point, but it's also not quite as simple as you make out there:
- First and foremost, we also need to add caps conversion functions,
since vorbisparse doesn't output raw video, and we need to be able to feed it through theoradec afterwards.
You need that anyway, chromium manually creates H.264 encoder and decoder instances and uses them without anything from the control layer. Because of this, we will at-least need to keep the mediatype->caps conversion function for compressed types.
It creates the h264 decoder transform manually, and doesn't use the rest of mfplat, or do I misunderstand you?
Yep, exactly, see https://github.com/chromium/chromium/blob/master/media/gpu/windows/dxva_vide... for the decoding code.
Okay, thanks. So yes, clearly we will need the transform anyway. Of course, we don't necessarily need anything other than the h264 transform.
- Also, I'm guessing you haven't dealt with "always" pads yet;
vorbisparse doesn't send "no-more-pads".
That would be ever easier to support.
- In the case that elements get added, removed, or changed from upstream
GStreamer, we have to reflect that here.
Elaborate?
If GStreamer supports a new media type or removes support for one, we have to reflect that. If caps details change upstream, that's something we should pay attention to as well; it could affect our conversion.
Are cap details even allowed to change like that? I find this very unlikely.
Sure. As I understand it, caps are only meant to connect together elements that know what those caps mean, which is partly why they're not always documented. I would presume that they'll try to preserve backwards-compatibility, but we also want to keep ahead of other changes that they make.
By contrast, the amount of code we have to add to deal with a new format when using decodebin is *exactly zero*. We don't actually have to write "audio/x-vorbis" anywhere in our code.
Okay, adding that path as a fallback makes a lot of sense then, since we still have full ability to fix compatibility issues with types that are natively supported in windows.
After all, we don't write it anywhere in quartz, and yet Vorbis still works. (If an application were to ask what the stream type is—and I doubt any do—we report it as MEDIATYPE_Stream, MEDIASUBTYPE_Gstreamer).
Second of all, the most obvious benefit, at least while looking at these patches, is that you now don't need to write caps <-> IMFMediaType conversion for every type on the planet.
I don't see this as a problem, most games I've seen will use either H.264 of WMV, and adding new formats isn't that difficult. You look at the caps exposed by the gstreamer demuxer, find the equivalent attributes in media foundation, and fill in the gaps. In return you get correct behavior, and a source that can be paired with a correctly written MFT from outside of the wine source.
This is basically true until it isn't. And it already isn't true if we want to support host codecs. An "add it when we need it" approach is going to be hell on media players.
I also think you're kind of underestimating the cost here. I don't like making LoC arguments, but your code to deal with those caps is something like 370 LoC, maybe 350 LoC with some deduplication.
As mentioned earlier in the email, the IMFMediaType->caps path will always be necessary, to support the decoder transforms, which real applications do use by themselves.
Sure. But as I understand, we'd only need to do the conversion one way (i.e. Media Foundation -> GStreamer), and we'd only need to bother with it for transforms that are explicitly created.
If you know how to convert a media foundation type into caps, you've already figured out everything you need to know about the other way around.
Well, mostly everything, because the conversions are never actually bijective, but regardless we don't actually have to write that code.
There's also the developer cost of looking up what GStreamer caps values mean (which usually requires looking at the source), looking up the Media Foundation attributes, testing them to ensure that the conversion is correct, figuring out how to deal with caps that either GStreamer or Media Foundation can't handle...
Another benefit is that you let
all of the decoding happen within a single GStreamer pipeline, which is probably better for performance.
I have applications working right now with completely acceptable performance, and we are still copying every uncompressed sample an extra time, which we may be able to optimize away. Copying compressed samples, on the other hand, is not that big of a deal at all.
I don't doubt it works regardless. DirectShow did too, back before I got rid of the transforms. It's also not the main reason I'm proposing this.
On the other hand, decreasing CPU usage is also nice.
How would this reduce CPU usage?
It's only an armchair hypothesis, so feel free to just ignore, but it probably means less buffer copies.
True, but only compressed buffer copies, which shouldn't have any noticeable impact.
Another thing that occurred to me is, letting everything happen in one GStreamer pipeline is nice for debugging.
I disagree, decodebin adds complexity to the pipeline that isn't otherwise necessary, like typefind.
I mostly meant along the lines of keeping all of the decoders in the same pipeline as the demuxer, which in my experience debugging GStreamer is easier to read than when they were bouncing through quartz.
typefind is pretty much necessary, unless we reimplement it ourselves (which, I understand, you've taken as granted that you'll have to do, but I'm not so sure).
Yes, even with your solution, the source resolver, if we want to be at all correct, will fine media sources based on searching the registry for the entry matching the mime type or file extension.
Sure, but if we use decodebin for "anything" or "anything else", we don't actually need to add such entries. Unfortunately it's not clear to me that mfplat allows that to be done through registry entries (unlike quartz), but adding code in resolver_get_bytestream_handler() seems unobtrusive enough to me.
I don't see how it's not nice for debugging either—sure, it takes up a lot of lines in the log figuring out the type, but in my experience I can always skip over that by searching for winegstreamer callbacks, or no-more-pads, or whatever it is I'm trying to debug.
True, it probably doesn't make much of a difference. Either way debugging gstreamer isn't very hard IMO, since their logging system is spectacular.
You also can simplify your
postprocessing step to adding a single videoconvert and audioconvert, instead of having to manually (or semi-manually) add e.g. an h264 parser element.
It isn't manual, we find a parser which corrects the caps. And as I mentioned in earlier email, we could also use caps negotiation for this, all the setup is in place.
Hence "semi-manually". You still have to manually fix the caps so that the element will be added.
As mentioned, we will need this regardless.
These are some of the benefits I had in mind when removing the
GStreamer quartz transforms.
Even in the case where the application manually creates e.g. an MPEG-4 source, my understanding is it's still the source's job to automatically append transforms to match the requested type.
It's not the source's job at all. On windows, where sources are purpose-built, they apply no transformations to the types they get, their goal is only to get raw sample data from a container / stream. It's the job of the media session, or source reader to apply transforms when needed.
I see, I confused the media source with the source reader. I guess that argument isn't valid, but I don't think it really affects my conclusion.
We'd just be moving that
from the mfplat level to the gstreamer level—i.e. let decodebin select the 'transforms' needed to convert to raw video and audio.
The media session and source reader shouldn't be affected by winegstreamer details. If a user/an application decides to install a third party decoder, we still need the infrastructure in place for this to function.
It obviously wouldn't match native structure, but it's not clear to me that it would fail to match native in a way that would cause problems. Judging from my experience with quartz, most applications aren't going to care how their media is decoded as long as they get raw samples out of it.
Most games, or most applications? Chromium uses media foundation in a much more granular way.
Yes, most applications.
What does Chromium do?
As mentioned earlier, uses decoders and encoders manually, so we'll have to fix up/parse the data we get anyway.
Only a select few build the graph manually because they don't
realize that they can autoplug, or make assumptions about which filters will be present once autoplugging is done, and some of those even fall back to autoplugging if their preferred method fails. Maybe the situation is different with mfplat, but given that there is a way to let mfplat figure out which sources and transforms to use, I'm gonna be really surprised if most applications aren't using it.
If you do come across an application that requires we mimic native's specific arrangement of sources and transforms, it seems to me it wouldn't require that much effort to swap a different parser in for decodebin, and to implement the necessary bits in the media type conversion functions. Ultimately I suspect it'd be less work to have a decodebin wrapper + specific sources for applications that require them, than to manually implement every source and transform.
The current solution isn't very manual, and, as I mentioned earlier in this email, you also can construct a decodebin wrapper source using the infrastructure which is available. And in general terms, I think it's more work to maintain a solution that doesn't match up to windows, as we now have to think of all these edge cases and how to work around them.
What edge cases do you mean?
Cases where applications expect compressed streams from the source.
The way I see it, you're essentially thinking of those "edge cases" now, except that you're not considering them edge cases.
I am considering them edge cases, but I do think it's important so I'm implementing the source accurately, it's not like my implementation is somehow less desirable for the common case.
Well, my point is that it kind of is, from the perspective of code quality and simplicity.
If we use decodebin, they become more obviously edge cases. But it's not a lot of thought it takes, from my view. We just need to ask, "what happens if an application depends on getting compressed samples?" and answer, "well, then we create a new media source, probably reusing most of the same infrastructure
There is no need to create a new media source implementation, just a new configuration of the current one which uses a different tool for demuxing.
Sure, most things can be shared. I just mean it's another (COM) object.
, that utilises the parts of gstreamer that output compressed samples." We don't actually have to do that work until we find such an application.
The work is already there 🐸.
Sure, but it doesn't have to be reviewed or committed to the tree.
On 3/26/20 8:07 PM, Zebediah Figura wrote:
While I await your more complete response, I figure I might as well clarify some things.
I don't think that "doing the incorrect thing", i.e. failing to exactly emulate Windows, should necessarily be considered bad in itself, or at least not nearly as bad as all that.
My view, and my understanding of the Wine project's view in general as informed by its maintainers, is that emulating Windows is desirable for public documented behaviour (obviously), for undocumented behaviour that applications rely on (also obviously), for undocumented or semi-documented behaviour where there's no difference otherwise and where the native thing to do is obvious (e.g. the name of an internal registry key).
In my view, when completely incorrect behavior is only a few function calls away, that's not acceptable. The media source is a well documented public interface, and doing something different instead is just asking for trouble.
The media source is a documented public interface, but *which* media source is returned from IMFSourceResolver is not documented or guaranteed, and which transforms are returned from the source reader is also not guaranteed.
Using decodebin is not "completely incorrect", and emulating Windows' specific arrangement of sources and transforms is not "a few function calls away".
Finding out the media type of a source is one function call away.
I don't understand what you mean. Which function call?
GetNativeMediaType
It's several hundred lines of code to do caps conversion, the entire transform object (which, to be sure, we might need *anyway*,
We will.
but also might not), and it means more work every time we have to deal with a new codec.
Unless we implement the decodebin solution as a fallback for unknown types. Taking the fallback approach means we will only have to go through this process for every type natively supported by windows.
But there's not really a reason to emulate Windows otherwise. And in a case like this, where there's a significant benefit to not emulating Windows exactly, the only reason I see is "an application we don't know yet *might* depend on it". When faced with such a risk, I weigh the probability of that happening—and on the evidence of DirectShow applications, I see that as low—with the cost of having to change design—which also seems low to me; I can say from experience (c.f. 5de712b5d) that swapping out a specific demuxer for decodebin isn't very difficult.
The converse of this is also true, if you want to quickly experiment with some gstreamer codec that we don't support yet, you just perform the hack I mentioned earlier, and then after you get it working you make it correct by adding the necessary gstreamer caps. Another hack we could use is to serialize the compressed caps, and throw them in a MF_MT_USER_DATA attribute, and hope that an application never looks.
Sure. But I'm willing to assert that one of these things is more likely than the other. I'm prepared to eat my words if proven wrong.
What do you mean, that in most cases applications won't care how they get their samples? That may be true, but I still think the edge cases are big enough to warrant the accurate approach. Unity3D, a pretty important user of this work, gets native media types of the source for instance. What they use it for, I'm not sure, but I wouldn't take any chances.
I mean it's more likely that an application wants uncompressed samples than that it wants compressed samples. As I see it, the latter case is (1) still hypothetical, (2) wouldn't be very difficult to implement either.
Yes, of course that's the case most applications want uncompressed samples, which is why they use the source reader or a session. And no, the latter is not very difficult, I've already done it. The point is, it matches windows, and it's easy to make a fallback path for any such case where it doesn't suffice. You just add a new source_desc that somehow specifies it is the hack source, and instead of searching for a demuxer, we just use use decodebin as the demuxer. Then, you either register this source with whichever container types you want to support, or add a hack in the source resolver which creates an instance of this source if it can't find a byte stream handler.
If we are going to have two paths anyway, the one which diverges from windows should be the one which takes the back seat, at-least in terms of code presence.
Based on my survey of GitHub above, I have to wonder what aspects of the native media type Unity3D actually cares about. What attributes does it ask for? Does it actually set the decoder to use a compressed media type? Even if the answer is yes, does it break if we return an uncompressed media type?
Yeah, I haven't tested that, but it just makes me feel very nervous about this.
But as I mentioned earlier, I don't think the amount of work required for adding a new media type is excessive. Microsoft only ships a limited amount of sources and decoders, they fit on a single page: https://docs.microsoft.com/en-us/windows/win32/medfound/supported-media-form... , so it's not like we'll be adding new types for years to come.
That's seven demuxers and sixteen transforms, which is still kind of a lot. It also, unsurprisingly, isn't every format that Windows supports; just looking at my Windows 7 VM I see also NSC and LPCM, and a much longer list of transforms.
And it doesn't take into account host codecs.
Insert fallback argument here :P
Not to mention that what we're doing is barely "incorrect". Media Foundation is an API that's specifically meant to be extended in this way.
I don't think Microsoft ever meant for an application to make a media source that decodes compressed content, the source reader and media session exist for a reason.
I don't think they specifically meant for an application *not* to do that. It fits within the design of Media Foundation. The reason that transforms exist—in any media API—is because different containers can hold the same video or audio codec. GStreamer can already deal with that.
For that matter, some application could easily register its own
codec libraries on Windows with a higher priority than the native ones (this happened with DirectShow); that's essentially no different than what I'm suggesting.
Yes, but even in that case, I assume they will still follow the basic concept of what a source is and is not.
I wouldn't necessarily assert that. A codec library—like GStreamer—might have its own set of transforms and autoplugging code. Easier to reuse that internally than to try to integrate it with every new decoding API that Microsoft releases.
That could potentially break other applications though, and I don't think codec libraries are comparable to gstreamer, they usually just handle a specific task and plug into the relevant part of the media API, whether it be dshow, media foundation, or gstreamer.
GStreamer *is* a codec library. That's exactly what it is.
"GStreamer is a pipeline-based multimedia framework that links together a wide variety of media processing systems to complete complex workflows."
I think I would consider something like ffmpeg a codec library, but either way, I think anyone adding transforms / sources are doing it because the functionality doesn't exist natively. And to maximize cohesion, they would probably use just use an external library to perform the desired action, and the rest of the code would be for hooking up to media framework it is operating within. A good example of this would be the libav gstreamer plugins. And I think there's a reason the opposite of this, libfluffgst, isn't very well known.
I think you're splitting hairs, but the point remains, libav has its own autoplugging mechanism. Any codec library that wants to be used directly is going to.
We don't yet know that any other applications would be broken; that's still hypothetical. It's not unheard of for applications to mess with Windows internals in ways that break other applications, to be sure. But it's also not a good idea.
I think the linked commit misses the point somewhat. That's partially because I don't think it makes sense to measure simplicity as an absolute metric simply using line count,
It's not just line count, the code itself is very simple, all we are doing is registering the supported input and output types of the decoder, setting the mime type of the container format for the source, and and registering both objects.
and partially because it's
missing the cost of adding other media types to the conversion functions
You can use the MF_MT_USER_DATA serialization hack if you're worried about that.
Unless you're proposing we use that in Wine, that doesn't affect anything.
You're right, the decodebin fallback is a much cleaner solution than that.
(which is one of the reasons, though not the only reason, I thought to write this mail). But it's mostly because the cost of using decodebin, where it works, is essentially zero:
Except in the cases where an application does something unexpected.
In which case the cost is still no more than the cost of not using decodebin.
we write one media source, and it
works for everything; no extension for ASF required.
There already is only one real implementation of the media source, the only "extension" is adding the mime type instead of using typefind. We will register the necessary byte stream handlers no matter which path we take.
Well, ideally we'd do what quartz does, and register a handler that catches every file, and returns a subtype that essentially identifies GStreamer.
If it never becomes
necessary to write a source that outputs compressed samples, then we also don't have the cost of abstraction (which is always worth taking seriously!), and if it does, we come out even—we can still use your generic media source, or something like it.
Ultimately, I think that a decodebin wrapper is something we want to have anyway, for the sake of host codecs like Theora,
Where would we use support for Theora, if no windows applications are able to use it.
Anything which wants to be able to play back an arbitrary media file, i.e. generic media players, mostly. I see all sorts of bug reports for these with Quartz, so people are definitely using them.
Heh.
and once we have
it, I see zero cost in using it wherever else we can.