The time to create wg_transform takes ~10ms here in test (and ~1ms to destroy). I doubt it is neglectable so we would want to recreate wg_transform when it can be easily avoided, but if you think it is maybe also worth testing in an actual game where there is also some side load. I am attaching the patch on top of this patchset which introduces wg_transform recreation instead of what this patch is doing and prints those times.
We currently discard intermediate buffers on (actual) output format change (iirc this is wrong as per tests and to match Windows on format change we'd need to convert those buffers somehow but it is unrelated here). Here we do not have to discard due to internal reasons because format doesn't actually change, frame size is essentially ignored and all the effect of trying to set it is getting the stream change message, all the already decoded samples are in correct format and size.
If we'll actually need to discard the buffers it looks trivial to drain transform when setting input type (I am not sure we need to, that'd need more precise tests but maybe that is a separate aspect / change anyway)?
[test.patch](/uploads/cfb829919120ea4f7a920ea736ccd2d5/test.patch)