I should've known you'd have an app where it matters.
I suspect that generating an intermediate RGBA image in the process of drawing should be avoided when source or destination bitmaps have no alpha.
I think we can get away with that if the source has no alpha, which isn't the case for indexed color, and then only if there's nothing that can add an alpha channel to it (such as a color transform, or interpolation when part of the source area pulls from the outside of the image).
In that case, we'd need a masked copy operation, which is equivalent to alpha blend when we know alpha is either 0 or 255. We can't just copy a rectangle because we can draw an image to an arbitrary parallelogram (unless we add to our list of conditions for an optimized codepath that the destination is a rectangle). There are many places where we could use such a thing, but I don't know how we would implement such that it is faster than the fastest we can reasonably do alpha blending (which is definitely NOT the way we are doing it now).
It does not matter whether the destination has alpha.
In general, I would prefer to see optimizations to the things we are already doing, before we start adding special cases.
Also the way how gdiplus currently scales source is pretty not optimal, using GDI instead is way faster and produces similar results.
Well, GDI is using a DIB engine so we should be able to match whatever it's doing.