We can choose between vectorizing at the vector level, or try to do it at the register level. I think the difference only becomes apparent when more than one vector/scalar share the same register.
On one hand, vectorizing register-wise is more general and is a better optimization. On the other, it is more complicated as we either should be considering register offsets at the IR level when running this passes, identifying in the struct if two components share the same register, or we should be deferring the pass after register allocation.
This is a difficult question, actually. I think the options are
(i) vectorize as much as possible, most notably past 4 components, and let translation into smX split up the ops again. This isn't *bad* conceptually, but the current codegen is really built to not do this.
(ii) vectorize up to 4 components and along aligned boundaries. The problem is that this is leaking backend details out of the backends again.
(iii) don't vectorize at all and make the backends do it instead. Disadvantage is that we may then need to code it multiple times.
There are some passes we'll need to do per-backend *anyway*, e.g. getting rid of the redundant movs from HLSL_IR_LOAD instructions. I'm not sure if this is such a case?
So far, I don't think we need (c) for some reason other than optimizing the output code and making the IR more compact, but it is worth thinking about it.
I'm inclined to agree, yeah. If nothing else it hasn't been *necessary* for anything, it was just a vague idea of something we could extend the pass too eventually.
Actually, (b) is also not exempt of this dilemma, since rhs could also vectorized vector-wise or register-wise. (maybe we should call them (b.1) and (b.2)).
(b) is about vectorizing constants, though? Probably we'd only vectorize them up to 4. Granted, though, you get the same issues as above, and vectorizing constants for sm1 also involves some tradeoffs if we leave that in common code.
- (b) and (d) indeed cover all the cases that (a) does?
- What cases do (b) and (d) cover that (a) doesn't?
Surprisingly, now I think that the answer to question (1) is no. Consider the following example which compiles in native and with pass (a):
Hmm, true. The alternate way to solve that is to pull that variable into the loop, but that requires knowing when the variable should be pulled into the loop (in general we probably want to go the other direction), and I can't easily think of another reason we'd want to do that (to alleviate register pressure, maybe?)
So (a) probably does make the most sense after all.
There is also the alternative of using hlsl_deref·s instead of mere nodes as texel_offsets, but IIRC Zeb has good reasons to not go in that direction.
I don't remember this ever being proposed? I don't see any benefit to it, either.
That we use hlsl_deref instead of hlsl_src for resource/sampler arguments to hlsl_ir_resource_load is... odd. Odd because every other argument expression is hlsl_src, and there's no obvious reason that these would be different. It works, though, because object types can always be resolved to a deref (i.e. if they were hlsl_src, they'd be guaranteed to come from HLSL_IR_LOAD), and it's necessary because we can't translate HLSL_IR_LOAD to an IR instruction.