That seems unnecessarily defeatist...? The behaviour of those elements is documented. If they're failing to respect parts of the GStreamer API, that's a bug in the element and it should be fixed there. If they're doing something less performantly than they should, that's also something that should be fixed in the element.
This has nothing to do with documentation, and the behavior of components when combined together isn't documented anyway. They can behave as they will, and all depends on the caps negotiation that happens between them and depends on their capabilities. The videoflip elements has specific video format requirements, and can very well end up causing suboptimal negotiation in the pipeline.
"Library code will run performantly" isn't the kind of thing that's really documented, no, but it tends to be a general truth regardless. If we find that a bit of library code isn't matching our needs, our first reaction should not be to assume we can't do anything about the library and abandon it. This especially in the case of GStreamer, who when I've worked with them have been very open to making changes to help consumers.
More concretely, if videoflip has poor performance because it doesn't support flipping e.g. RGB16, then we can add support for that to videoflip; it wouldn't even be hard. I don't know what the specific issue is here either, but I'd be very surprised if it's unsolvable.
Why is this any more the "right" way than using videoflip?
We are not doing any kind of frame flipping, but instead we are implementing buffers with negative strides. Providing stride information to GStreamer is the right way to tell about buffer stride. Using a videoflip element is an equivalent but convoluted way to do it.
Okay, I can see how DirectShow's negative stride convention maps conceptually to GStreamer's similar convention. At the same time, I have a hard time seeing a manual videoflip as wrong.
I also don't understand the bit about complexity. From a somewhat abstract level, code gets more complex and harder to work with when you add multiple _different_ interacting components. Having more of the _same_ component—in this case, adding more beads to a string of postprocessing elements—doesn't make anything harder to work with.
Of course it does, it increases the number of possible failures. Doesn't matter of the components are the same, the more you add the more complex it gets. And the worst part is that it's not components we have the source directly available, it's GStreamer components which are most often pre-built from the system distribution.
Take debugging of the current video processor pipeline for instance, you have three elements when we could have only one. The two videoconvert and videoflip are talking to each other back and forth to negotiate their caps. Decyphering the GStreamer trace to understand what is actually going on, and figure out in the end that somewhere in the middle of all these verbose traces, videoflip has decided to drop the pool it was provided by downstream element to use its own, _is_ way more complicated than it could be.
I'm sorry that you had some difficulty there, but I remain unconvinced that a string of three postprocessing elements is more complex than a single postprocessing element interacting with a custom buffer pool. Speaking from experience, I would find it more difficult to maintain and debug these two interacting components, than a string of postprocessing elements.
Even with this said, I'm not conceptually opposed to the idea of using stride, but the hoops we have to jump through to do it seem excessive. I'm not convinced it's better than videoflip.
And I must be missing something, because I don't see this situation tested in test_video_processor()? I only see tests where both the input and output have the same aperture.
There's one test which is fixed with this MR (well now that I've split the last patch, there's another one which is broken then fixed), and it is about using an aperture on input and no aperture on output. It fails before this MR, it passes after.
That test has the same actual aperture on input and output, though, i.e. the same content size, or am I misreading somehow?
And I still don't understand why the format that's passed to wg_transform_set_output_format isn't the format we store as transform->output_format.