On 6/9/22 04:04, Matteo Bruni wrote:
The ugliness that we've run into is: how do we emit IR for the following variable load?
struct apple { int a; struct { Texture2D b; int c; } s; } a; /* in some expression */ func(a.s);
Unlike the SM1 example above, the register numbers don't match up. Separately, it's kind of ugly that backend-specific details regarding register size and alignment are leaking into the frontend so much.
I think most of that can be hidden or contained with some proper abstraction. And generous handwaving. But basically, that probably could be represented in the IR as copying around individual fields of the structure separately, rather than a single "struct deref". Clearly it can become more complex depending on the type of the variable but I think it should be doable.
Yeah, it could. Like I said it's not prohibitive. I'm just not sure it's the best option at this point.
It's worth pointing out that, at parse time, we want and need for load instructions (and therefore probably also store instructions) to have larger-than-vector types—that is, load instructions can produce structs, and store instructions can consume them. But we don't want that for SMxIR, and I believe we don't want that for the "final form" of HLSL IR either. That's the way the code is currently arranged and I see no reason not to keep it that way.
Similarly, the amount of code that has to deal with matrix majority is unfortunate.
That personally seems more annoying. Although it's not clear to me that handling matrix majority at a later stage is necessarily any better.
The main idea is that we could handle it something closer to once (well, once per backend), at HLSL -> SMx translation.
That doesn't necessarily mean requiring that all matrix loads and stores are done on a single scalar—after all, we could translate a single vector load to multiple MOV instructions if it can't actually be represented by one.
It does potentially mean doing vectorization passes on SMxIR, though. Hard to tell this far in advance, and it's also hard to tell if that's something we're going to need anyway.
The former problem can potentially be solved by embedding multiple register offsets into hlsl_deref (one per register type). Neither this nor the latter problem are prohibitive, and I was at one point in favour of continuing to use register offsets everywhere, but at this point my feeling has changed, and I think using register offsets is looking more ugly than the alternatives. I get the impression that Francisco disagrees, though, which is why we should probably hash this out now.
As I mention below, I currently see two options as the most appealing. This one (multiple register offsets) sits somewhat in the middle and it feels like it would be best to go to one of the extremes instead. It's also possible that this middle ground solution would end up being nicer in practice. At any rate, I certainly wouldn't flat out discount it.
Nor do I think we should use both register offsets and component offsets (either in the same node type, or in different node types). That just makes the IR way more complicated. Rather, I think we should be doing everything in *just* component offsets until translation from HLSL IR to SMx IR.
I touched on this earlier and I agree that the additional complexity is unlikely to be worth it. Admittedly we're in a limbo right now where SMxIR isn't quite there yet, which makes reasoning on some of these details a bit fuzzy.
In order to deal with the problem of translating dynamic offsets from components to registers, I see three options:
(a) emit code at runtime, or do some sophisticated lowering,
(b) use special offsetof and sizeof nodes,
(c) introduce a structured deref type, much like [1]. Francisco was actually proposing something like this, although with an array instead of a recursive structure, which strikes me as an improvement.
My guess is that (a) is very hard. I haven't really tried to reason it out, though.
Given a choice between (b) and (c), I'm more inclined to pick (c). It makes the IR structure more restrictive, and those restrictions fundamentally match the structured nature of the language we're working with, both things I tend to like.
After giving it some thought I think that's certainly fine *for the higher level IR*. At the same time it seems to me that, if we go that route, eventually we also want to have real SMxIR with register offsets, and make sure that we can optimize constant offsets (thus expressions) at that level.
As I see it (as of current time and date, can't guarantee that I won't change my mind again...) we either push the backend-specific info up (register offsets all the way) or down (component offsets with structured deref / type info in the generic IR, transformation into register offsets in the SMxIR). I think either option works and it's mostly a matter of preference and which one fits / feels better with the rest of the compiler.
Yeah, that general approach makes sense to me. And yes, of course the SMxIR should deal entirely in register offsets.
My current vision of SMxIR is that it should be a one-to-one representation of actual instructions, writable without any lowering passes (and hence any passes that are done on it should be optimization only, with the *possible* exception of RA.) In a sense, it's what we have already with sm4_instruction and such, except that we'd be storing it and doing passes on it rather than just writing it directly.
Between those two extremes—well, what we currently have basically *is* the first extreme, with register offsets pushed all the way up to parse time. It's just causing some friction that makes me think the latter extreme is probably going to be pretty.
Note that either way we're going to need specialized functions to resolve deref offsets in one step. I also think that should depend on the domain—e.g. for copy-prop we'll actually want to do everything in component counts, but when translating to SMxIR we'll evaluate given the register alignment constraints of the shader model. In the case of (b) it's not going to be as simple as running the existing constant folding pass, because we can't actually fold the sizeof/offsetof constants (unless we dup the node list, evaluate, and then fold, which seems very hairy and more work than the alternative).
Right, each option will have different tradeoffs WRT optimization passes. But e.g. copy-prop should be doable even with register offsets, we "just" need to make sure to always map the component offsets to their respective register offsets.
Quite, in fact we're already doing it that way. But it's probably better to work with components, since we (a) don't waste space tracking padding [not very important], and (b) don't have to deal with multiple register sets [more important].
I invite thoughts—especially from Matteo, since we discussed this sort of problem ages ago.
Yep, hope that my comments make sense. I want to hear from the others too.
ἔρρωσθε, Zeb
[1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html
[2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html