On Thu Nov 24 00:07:23 2022 +0000, Francisco Casas wrote:
### Regarding (b) and (c), i.e. vectorization: To achieve vectorization, in particular (c) for, say:
a.x = b.x; // other instructions a.y = b.y
If we merge the two operations down, we have to make sure that:
- `a.x` is not read/write by the other instructions in between.
- `b.x` is not written in between.
If we merge the two operations up, we have to make sure that:
- `a.y` is not read/write by the other instructions in between.
- `b.y` is not written in between, and its hlsl_ir_load is before the
`a.x =` instruction, or can be moved there. Were we also have to consider the possibility of non-constant paths accesing these values. (b) would be easier to achieve than (c) since, if we replace `b.x` and `b.y` with constants in this example, we know that these values will never be written to. Also, (c) is more complex since we also have to keep an eye on the location of the `b.x` and `b.y` hlsl_ir_load·s in the IR. Since this pass wouldn't be helping copy-prop, it probably makes sense to run it after the `do ... while(progress)` that includes copy-prop and friends. To implement this, we can write a function to check whether is possible for a value to be accessed for read/write within a list of instructions, giving an starting point and an end point, and use it to check if these conditions meet for each pair of instructions that are compatible (both are stores from `b` to `a` in different components). However, to have the same reach as copy-prop, we would also have to consider how to handle control flow with this pass. There may be ifs or loops in between, or one of the instructions may be inside a block where the other is not. Since I don't have clear answers on how to implement the latter cases, I assume (b) and (c) would only operate for pairs of instructions in the same block.
You probably want to split, at least at the conceptual level, this into two different steps: first, what optimization can you do when the relevant instructions are already consecutive (i.e., your `other instructions` block is empty); second, how to commute instructions in order to make "interesting" instruction consecutive. You're probably going to reuse the second step for many different passes (again, at the conceptual level; it's not obvious that code will be easy to reuse).
So, when can you commute two IR instructions? And when can you move an IR instruction outside or inside a block? Clearly you always have to keep things consequential, so you can't move x after y if y uses the result of x. Then things become a bit more complicated and I'm doing little more than brainstorming, so take this with a grain of salt. Maybe two or three grains. * Constants, expressions and swizzles commute happily with more or less everything else, and can go inside or outside code blocks. * Loads and stores only touch thread-local stuff (I am not sure that "thread" is the right word in the GPU world, but I guess you get what I mean), so commute among themselves (provided they don't touch the same registers) and with resource loads and stores. They don't commute (in general! They do in some cases) with blocks and jumps, or cannot go inside or outside code blocks. * Resource loads and stores are more interesting, because they interact with other threads. The way they commute is dictated by the memory model. Searching the web for "hlsl memory model" or "sm4 memory model" gives basically no useful result, but I guess we can assume that it will be rather weak, so resource loads and stores can commute rather easily, except when they touch the same address. Or maybe even in that case, who knows? The more I discover about GPUs the more I am baffled. At some point we'll bring in barriers to prevent them commuting too much, but for the moment we don't support them. If in doubt, it's more appropriate to err towards a stronger memory model (allowing fewer optimizations, but avoiding introducing miscompilations).
As usual, lots of fun to be had!