The videoflip removal itself only adds 40 LoC, and gets rid of any unknowns related to the videoconv-videoflip-videoconv sequence.
Note that it's not the only purpose of this change, as it also introduce metadata on input buffers, in a symmetric way with what we already do on output buffers. This is useful if not necessary for the changes after it, which use that metadata to describe the input/output buffer padding.
I believe that it might also be useful to simplify encoder transform implementation if we ever need to make them accept RGB input. Having the input buffer stride correctly described would be enough for bottom-up buffers to be correctly read.