On 4/28/20 6:38 AM, Matteo Bruni wrote:
On Sun, Apr 19, 2020 at 11:06 PM Zebediah Figura zfigura@codeweavers.com wrote:
On 4/17/20 5:43 PM, Matteo Bruni wrote:
I see your point. I guess a possibility opened by the flat derefs is that you could share intermediate derefs, although that might well make things more complicated for no benefit.
I mean, it could potentially be done. I guess the easiest way that occurs to me is to set a flag that just states whether a HLSL_IR_DEREF is read from...
I think I understand why flattening is desirable in general, but it's not clear to me how that would apply to lvalues (or even to rvalue deref chains, honestly), since at codegen time we essentially resolve arbitrarily complex deref chains into one register (well, except for non-constant array indexing...)
True, but the same can be said in principle for any other instruction.
What I meant to say is that an expression like "(a + b) * c" is necessarily two instructions, whereas "a.b.c.<...>.e = f" is always one, regardless of the length of the deref chain.
Not if that's e.g. a matrix assignment, that needs to become multiple mov instructions at some point. Stretching it a bit, the former can potentially be just one instruction (MAD).
Sure, I guess I'm assuming that we've already gone through splitting passes so that's not a concern.
Also if you always "go flat" you have only one way to follow sources, instead of having to mix tree traversing inside a single instruction with instruction traversing. Relatedly, if we start doing more invasive IR transformations, having simple, "atomic" instructions might make things easier. E.g. with structure splitting, I think you'd end up with replacing one instruction (as in one instruction list entry) without changing the rest of the IR, while in the other case you have to replace a piece of a larger instruction which might be buried arbitrarily deep in the tree. Admittedly this is all very abstract and I might not be seeing a sensible overall picture.
I guess the way I look at it is, emitting an assignment instruction means we have to reach up the entire chain at codegen time anyway, i.e. I'm not sure how you can emit an instruction for each individual lvalue deref node, whereas you can emit (at least) one instruction for each node currently.
My point is that, during compilation, we probably have to traverse the instruction list to lookup sources for non-deref nodes anyway. At that point, if derefs are also full instructions / nodes, you get those for free. If, on the other hand, derefs are stored in an arbitrarily deep tree inside other instructions, we need to have a way to traverse derefs in addition to the "non-derefs" instructions. That's just me guessing though.
I don't particularly understand this assertion (despite our conversation in IRC about it). Maybe you could describe a specific optimization that would be helped by this, if you have any in mind?
Fundamentally I can understand that things have to recursively reach through deref nodes (well, unless we collapse them down to a single offset, like your suggestion below), but it's not clear to me that this is necessary for anything but struct or array derefs (and in that case, I don't see a lot of problem with treating them as a special case).
Not that this is hugely important, though, if I pursue the offsetting approach and it works.
To be clear, I don't think it's awful to emit nodes for lvalue derefs, I'm just not convinced it's better—but I'll defer to the maintainer in any case.
Well, sure, but what I'm doing is mostly picturing things in my head, which doesn't always work right.
It's also not obvious to me that's prettier to flatten it, given on the other hand we're also now inserting nodes into the IR list that would presumably not participate in RA or generate code.
Right, but that's a bit like doing an optimization right while parsing. I guess we want lvalues and rvalues / derefs to either be both flat or both tree-like. Let's focus on rvalues for a moment. Imagine you have a complicated deref chain, eventually picking a single float out of something larger and complex. If you have a very naive compiler you can imagine that each step of the deref is implemented as a copy of whatever the result of the deref step is into a temporary variable. This might mean e.g. temporarily copying a full matrix before getting at the individual component. Incredibly inefficient, but simple and consistent. Obviously this isn't very interesting and can be simplified right away... except when it can't, like the non-constant array indexing case you mention. Or copying a whole input structure to an output variable. For lvalues you can imagine the specular thing, where the naive compiler stores results into a temporary variable and then copies that into the actual destination.
It seems to me that it would be nicer to have a simple and flat IR structure where each entry can do very little, or even nothing, in practice, and handle simplifications and IR restructuring over a number of passes rather than the opposite. Somewhat related, you can already get instructions that eventually do nothing of consequence (e.g. dead code, swizzling into the exact same thing, ...) and I don't think that has to be necessarily that different from the compiler's point of view.
So I'm not entirely sure from this if you have something specific in mind, but I'll at least run through my thoughts on the matter.
The compiler that exists in at least two of my local trees does emit a copy for every HLSL_IR_DEREF, i.e. from a variable into a temp node. This is awful, inasmuchas it matters, but the bottom line is by design, HLSL_IR_DEREF is essentially completely redundant, as you say.
One way to get around this—and the approach implied if we do make each deref an explicit node—is to reach through all the derefs at codegen time and grab the relevant register. We'd skip actually emitting code for any deref node. Hence an expression like "func().b.c + 1" would generate IR like
2: func() 3: @2.b 4: @3.c 5: 1 6: @4 + @5
Inlining aside, we'd only allocate registers for 2, 5, 6, and the source to node 6 would offset the register allocated to node 2 by the offset of "b.c".
Note that liveness analysis would also have to recursively reach through deref nodes.
The advantage of this, I guess, is that if we really do want to make lvalues separate nodes, then rvalues should probably be separate nodes too.
Right. One way to see it is that this is essentially a specialized form of copy propagation that only applies to derefs (which is admittedly the most important case, but not the only one). If we have actual copy (or maybe even constant) propagation we become able to 'skip' register allocation for other node types too. I already mentioned swizzling as something can do without a separate instruction and register, but it's not limited to that.
The second alternative that I was looking at was to make this a bit more explicit, and essentially let the hlsl_src structure contain a union of hlsl_ir_node and hlsl_deref (note that hlsl_deref.array/record would then have to be a pointer to hlsl_src instead of the flat structure as patch 0005 has it). Then we'd essentially generate
2: func() 3: 1 4: @2.b.c + 3
I mildly prefer this, since I like having the IR language be explicit about what's allowed and to resemble the code we produce, and it's not difficult to form this even at parse time. We can't easily get rid of HLSL_IR_DEREF entirely here, but we can guarantee it doesn't occur except at parse time (plus maybe a dce run?)
I guess the downside of this approach is that derefs are a special case to some degree. Which isn't necessarily bad but it feels a little limiting, as in my comment above.
A third alternative is to adopt my patches 010* from my 6 April mail, i.e. to make both lvalue and rvalue derefs explicitly chains of derefs, and synthesize variables when necessary. That doesn't preclude folding hlsl_deref into hlsl_src, of course.
Right. I feel this ends up conflating a few things together though. Representing lvalues and rvalues as chains of (only) derefs is an IR policy; synthesizing temporary variables is the mechanism; The relation between hlsl_deref and hlsl_src is the "enforcement". Those are all good but they are separate pieces of the puzzle. E.g. as I mentioned earlier, synthesizing variables is a mechanism that will probably come in handy for one reason or another even if we decide to go with a different route for derefs.
I have another proposal, as a way to sidestep the whole deref story. As usual, with the caveat that this is all armchair planning and it is known that plans rarely survive contact with the enemy. Anyway...
With the current IR we end up having arbitrarily complex deref chains to represent the composition of multiple levels of array and struct "indexing", which are necessarily cumbersome to handle. What if we transform all of that into offsetting instead? An arbitrary variable dereference would be represented as (variable, offset), with offset an arbitrary expression with an integer scalar value. That requires that we know the layout of complex data types early on, which now we can do, I think. The offsets could be in scalar or vec4 units, I can see benefits for either choice. Ideally we would directly generate offsets right during parsing and skip derefs altogether, although that might not be practical. An obvious downside coming from this is that constant folding becomes pretty much required, to compute a register index at codegen time. It does have the advantage that non-constant offsets are no different from constant ones as far as IR is concerned. Obviously we can skip supporting those initially, regardless of all this.
I like this inasmuch as it's basically "hlsl_deref chains, but simpler"; I guess it requires no recursion anywhere, and so kind of sidesteps the constraints that led us here in the first place.
What do you think? I have a complication right away, which is the fact that in SM3 struct variables can be allocated into multiple register sets (see http://www.winehq.org/pipermail/wine-devel/2013-August/100705.html). I don't think it's a blocker but it is annoying...
The way I'd solve that is just to allow for potentially multiple hlsl_reg structures to any given uniform. That could be as simple as including three separate hlsl_reg structures inside hlsl_var. sm2-3 apparently allocates structs by limiting the size to include the last used element (whereas prior used elements remain unused). That'd just have to be done per type.
I'm admittedly not sure how changing our representation of derefs affects this specific problem, but maybe I'm missing something.
To my simultaneous delight and chagrin I can't come up with anything else that this approach would break.