Re: [PATCH 2/5] d3dcompiler: Introduce a separate structure for lvalue derefs.

28 Apr 2020


      On 4/28/20 6:38 AM, Matteo Bruni wrote:
...
On Sun, Apr 19, 2020 at 11:06 PM Zebediah Figura
zfigura@codeweavers.com wrote:
...
On 4/17/20 5:43 PM, Matteo Bruni wrote:
...
I see your point. I guess a possibility opened by the flat derefs is
that you could share intermediate derefs, although that might well
make things more complicated for no benefit.
I mean, it could potentially be done. I guess the easiest way that
occurs to me is to set a flag that just states whether a HLSL_IR_DEREF
is read from...
...
...
I think I understand why flattening is desirable in general, but it's
not clear to me how that would apply to lvalues (or even to rvalue deref
chains, honestly), since at codegen time we essentially resolve
arbitrarily complex deref chains into one register (well, except for
non-constant array indexing...)
True, but the same can be said in principle for any other instruction.
What I meant to say is that an expression like "(a + b) * c" is
necessarily two instructions, whereas "a.b.c.<...>.e = f" is always one,
regardless of the length of the deref chain.
Not if that's e.g. a matrix assignment, that needs to become multiple
mov instructions at some point.
Stretching it a bit, the former can potentially be just one instruction (MAD).
Sure, I guess I'm assuming that we've already gone through splitting 
passes so that's not a concern.
...
...
...
Also if you always "go flat" you have only one way to follow sources,
instead of having to mix tree traversing inside a single instruction
with instruction traversing.
Relatedly, if we start doing more invasive IR transformations, having
simple, "atomic" instructions might make things easier. E.g. with
structure splitting, I think you'd end up with replacing one
instruction (as in one instruction list entry) without changing the
rest of the IR, while in the other case you have to replace a piece of
a larger instruction which might be buried arbitrarily deep in the
tree.
Admittedly this is all very abstract and I might not be seeing a
sensible overall picture.
I guess the way I look at it is, emitting an assignment instruction
means we have to reach up the entire chain at codegen time anyway, i.e.
I'm not sure how you can emit an instruction for each individual lvalue
deref node, whereas you can emit (at least) one instruction for each
node currently.
My point is that, during compilation, we probably have to traverse the
instruction list to lookup sources for non-deref nodes anyway. At that
point, if derefs are also full instructions / nodes, you get those for
free. If, on the other hand, derefs are stored in an arbitrarily deep
tree inside other instructions, we need to have a way to traverse
derefs in addition to the "non-derefs" instructions. That's just me
guessing though.
I don't particularly understand this assertion (despite our conversation 
in IRC about it). Maybe you could describe a specific optimization that 
would be helped by this, if you have any in mind?
Fundamentally I can understand that things have to recursively reach 
through deref nodes (well, unless we collapse them down to a single 
offset, like your suggestion below), but it's not clear to me that this 
is necessary for anything but struct or array derefs (and in that case, 
I don't see a lot of problem with treating them as a special case).
Not that this is hugely important, though, if I pursue the offsetting 
approach and it works.
...
...
To be clear, I don't think it's awful to emit nodes for lvalue derefs,
I'm just not convinced it's better—but I'll defer to the maintainer in
any case.
Well, sure, but what I'm doing is mostly picturing things in my head,
which doesn't always work right.
...
...
...
It's also not obvious to me that's prettier to flatten it, given on the
other hand we're also now inserting nodes into the IR list that would
presumably not participate in RA or generate code.
Right, but that's a bit like doing an optimization right while parsing.
I guess we want lvalues and rvalues / derefs to either be both flat or
both tree-like. Let's focus on rvalues for a moment.
Imagine you have a complicated deref chain, eventually picking a
single float out of something larger and complex. If you have a very
naive compiler you can imagine that each step of the deref is
implemented as a copy of whatever the result of the deref step is into
a temporary variable. This might mean e.g. temporarily copying a full
matrix before getting at the individual component. Incredibly
inefficient, but simple and consistent.
Obviously this isn't very interesting and can be simplified right
away... except when it can't, like the non-constant array indexing
case you mention. Or copying a whole input structure to an output
variable.
For lvalues you can imagine the specular thing, where the naive
compiler stores results into a temporary variable and then copies that
into the actual destination.
It seems to me that it would be nicer to have a simple and flat IR
structure where each entry can do very little, or even nothing, in
practice, and handle simplifications and IR restructuring over a
number of passes rather than the opposite. Somewhat related, you can
already get instructions that eventually do nothing of consequence
(e.g. dead code, swizzling into the exact same thing, ...) and I don't
think that has to be necessarily that different from the compiler's
point of view.
So I'm not entirely sure from this if you have something specific in
mind, but I'll at least run through my thoughts on the matter.
The compiler that exists in at least two of my local trees does emit a
copy for every HLSL_IR_DEREF, i.e. from a variable into a temp node.
This is awful, inasmuchas it matters, but the bottom line is by design,
HLSL_IR_DEREF is essentially completely redundant, as you say.
One way to get around this—and the approach implied if we do make each
deref an explicit node—is to reach through all the derefs at codegen
time and grab the relevant register. We'd skip actually emitting code
for any deref node. Hence an expression like "func().b.c + 1" would
generate IR like
2: func()
3: @2.b
4: @3.c
5: 1
6: @4 + @5
Inlining aside, we'd only allocate registers for 2, 5, 6, and the source
to node 6 would offset the register allocated to node 2 by the offset of
"b.c".
Note that liveness analysis would also have to recursively reach through
deref nodes.
The advantage of this, I guess, is that if we really do want to make
lvalues separate nodes, then rvalues should probably be separate nodes too.
Right. One way to see it is that this is essentially a specialized
form of copy propagation that only applies to derefs (which is
admittedly the most important case, but not the only one). If we have
actual copy (or maybe even constant) propagation we become able to
'skip' register allocation for other node types too. I already
mentioned swizzling as something can do without a separate instruction
and register, but it's not limited to that.
...
The second alternative that I was looking at was to make this a bit more
explicit, and essentially let the hlsl_src structure contain a union of
hlsl_ir_node and hlsl_deref (note that hlsl_deref.array/record would
then have to be a pointer to hlsl_src instead of the flat structure as
patch 0005 has it). Then we'd essentially generate
2: func()
3: 1
4: @2.b.c + 3
I mildly prefer this, since I like having the IR language be explicit
about what's allowed and to resemble the code we produce, and it's not
difficult to form this even at parse time. We can't easily get rid of
HLSL_IR_DEREF entirely here, but we can guarantee it doesn't occur
except at parse time (plus maybe a dce run?)
I guess the downside of this approach is that derefs are a special
case to some degree. Which isn't necessarily bad but it feels a little
limiting, as in my comment above.
...
A third alternative is to adopt my patches 010* from my 6 April mail,
i.e. to make both lvalue and rvalue derefs explicitly chains of derefs,
and synthesize variables when necessary. That doesn't preclude folding
hlsl_deref into hlsl_src, of course.
Right. I feel this ends up conflating a few things together though.
Representing lvalues and rvalues as chains of (only) derefs is an IR
policy; synthesizing temporary variables is the mechanism; The
relation between hlsl_deref and hlsl_src is the "enforcement". Those
are all good but they are separate pieces of the puzzle. E.g. as I
mentioned earlier, synthesizing variables is a mechanism that will
probably come in handy for one reason or another even if we decide to
go with a different route for derefs.
I have another proposal, as a way to sidestep the whole deref story.
As usual, with the caveat that this is all armchair planning and it is
known that plans rarely survive contact with the enemy. Anyway...
With the current IR we end up having arbitrarily complex deref chains
to represent the composition of multiple levels of array and struct
"indexing", which are necessarily cumbersome to handle. What if we
transform all of that into offsetting instead? An arbitrary variable
dereference would be represented as (variable, offset), with offset an
arbitrary expression with an integer scalar value. That requires that
we know the layout of complex data types early on, which now we can
do, I think. The offsets could be in scalar or vec4 units, I can see
benefits for either choice. Ideally we would directly generate offsets
right during parsing and skip derefs altogether, although that might
not be practical.
An obvious downside coming from this is that constant folding becomes
pretty much required, to compute a register index at codegen time. It
does have the advantage that non-constant offsets are no different
from constant ones as far as IR is concerned. Obviously we can skip
supporting those initially, regardless of all this.
I like this inasmuch as it's basically "hlsl_deref chains, but simpler"; 
I guess it requires no recursion anywhere, and so kind of sidesteps the 
constraints that led us here in the first place.
...
What do you think? I have a complication right away, which is the fact
that in SM3 struct variables can be allocated into multiple register
sets (see http://www.winehq.org/pipermail/wine-devel/2013-August/100705.html).
I don't think it's a blocker but it is annoying...
The way I'd solve that is just to allow for potentially multiple 
hlsl_reg structures to any given uniform. That could be as simple as 
including three separate hlsl_reg structures inside hlsl_var. sm2-3 
apparently allocates structs by limiting the size to include the last 
used element (whereas prior used elements remain unused). That'd just 
have to be done per type.
I'm admittedly not sure how changing our representation of derefs 
affects this specific problem, but maybe I'm missing something.
To my simultaneous delight and chagrin I can't come up with anything 
else that this approach would break.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH 2/5] d3dcompiler: Introduce a separate structure for lvalue derefs.