On Sat Nov 19 01:27:25 2022 +0000, Zebediah Figura wrote:
Giovanni actually sent a very similar patch to 1/4 [1] several months ago. It wasn't accepted, and I'm not sure what discussion there was about it if any (it may have been off-list). I think that this is, broadly speaking, not a *wrong* solution—and if we care about getting texel offsets working I don't mind accepting it—but I don't think it's quite the right one either. I think there are two other passes that we can do instead that would do everything this pass does and more: (1) generic vectorization of stores (2) propagation of load+store sequences into multiple store instructions Consider specifically the following shader:
uniform float f; float4 main() : sv_target { return float4(f, 1, 2, 3); }
which generates:
2: float | f 3: uint | 0 4: | = (<constructor-2>[@3].x @2) 5: float | 1.00000000e+00 6: uint | 1 7: | = (<constructor-2>[@6].x @5) 8: float | 2.00000000e+00 9: uint | 2 10: | = (<constructor-2>[@9].x @8) 11: float | 3.00000000e+00 12: uint | 3 13: | = (<constructor-2>[@12].x @11) 14: float4 | <constructor-2> 15: | return 16: | = (<output-sv_target0> @14)
Copyprop-with-constant-loads won't help this. However, with the first pass I describe, we can simplify that into:
2: float | f 3: uint | 0 4: | = (<constructor-2>[@3].x @2) 5: float3 | {1.0 2.0 3.0} 6: uint | 1 7: | = (<constructor-2>[@6].xyz @5) 14: float4 | <constructor-2> 15: | return 16: | = (<output-sv_target0> @14)
The second pass would recognize when the rhs of a store instruction is just a load, and rewrite it as multiple store instructions. (These passes could be applied in either order, fwiw):
2: float | f 3: uint | 0 4: | = (<output-sv_target0>[@3].x @2) 5: float3 | {1.0 2.0 3.0} 6: uint | 1 7: | = (<output-sv_target0>[@6].xyz @5)
(Ignoring the return statement for now. Also, I've automatically applied DCE to these examples.) Note that this second pass doesn't necessarily have to be limited to whole-variable stores either. (Note also that we can't do the same thing with copy-prop because a load instruction *has* to return the whole vector type; it can't be split.) This doesn't fully replace copy-prop as it is, to be sure, but I think it replaces everything that patch 1/4 does. Copy-prop is designed to replace X-store-load sequences with X, to oversimplify. If vectorization can turn X into a single instruction, then copy-prop can get rid of the store/load and probably the whole temp variable. If vectorization *can't* do that, then we needed that temp anyway. [1] https://www.winehq.org/pipermail/wine-devel/2022-May/215773.html
I think I described this second pass poorly. It's actually probably easier to think of it as an *extension* of copy-prop.
The idea is, if you have something like that float4(f, 1, 2, 3) sequence, you can't copy-prop loads from `<constructor-2>` with the current pass because they don't all have the same instruction as a source. So what you do instead is, any time that `<constructor-2>` is itself used as an rhs to a store, you split up that store into multiple stores. You could say one per unique source but you could also do it more simply by saying one per component (and then letting vectorization clean that up later). That way you're guaranteed to be able to copy-prop them. You increase the number of stores but (hopefully) reduce the number of intermediate variables.
Actually, now that I describe it this way, I realize it doesn't even have to be limited to stores, it can be anything that's done per-component, so you could do a similar thing with some HLSL_IR_EXPRs.