The following thread is based partly on, and makes reference to, private conversation, but for the sake of openness I've elected to post it to wine-devel.
A long time ago, HLSL_IR_LOAD—then called HLSL_IR_DEREF—was this:
enum hlsl_ir_deref_type { HLSL_IR_DEREF_VAR, HLSL_IR_DEREF_ARRAY, HLSL_IR_DEREF_RECORD, };
struct hlsl_deref { enum hlsl_ir_deref_type type; union { struct hlsl_ir_var *var; struct { struct hlsl_ir_node *array; struct hlsl_ir_node *index; } array; struct { struct hlsl_ir_node *record; struct hlsl_struct_field *field; } record; } v; };
struct hlsl_ir_deref { struct hlsl_ir_node node; struct hlsl_deref src; };
Now, one problem with this is that it was kind of mean to RA and liveness analysis. For example, a line of HLSL like
var.a.b = 2.0;
produced the following IR:
2: 2.0 3: deref(var) 4: @3.b 5: @4.c = @2
This is annoying because:
* to discover that "var" is written, @5 needs to reach upwards through a deref chain;
* reaching through the deref chain requires lots of assert() statements;
* @3 implies that "var" is read, which it isn't (and, if we reach upwards through the deref chain, @4 implies the same thing).
I proposed that instead of using generic node pointers, we could have arbitrarily long deref chains encoded in the hlsl_deref structure itself. [1]
There was some discussion on that—which is mostly concentrated in that thread, and also IRC. Most of the concern is about being nicer to liveness analysis and RA.
What ultimately ended up happening is that Matteo proposed numeric (register) offsets calculated at parse time, which is fundamentally similar to my idea except that it's a lot simpler to work with.
Interestingly, the problem of multiple register sets was brought up [2]:
From my testing it essentially does, yes, i.e. if you have
struct { int unused; float f; bool b; } s; float4 main(float4 pos : POSITION) : POSITION { if (s.b) return s.f; return pos; }
then "s" gets allocated to registers b0-b2 and c0-c1, but only b2 and c1 are ever used.
So yeah, it makes things pretty simple. I can see how it would have been a lot uglier otherwise.
I guess we've finally run into that ugliness now :-(
The ultimate conclusions to draw from this historical exercise are:
- what I said about "we used to have derefs handled like that" is mostly correct, although not quite. We did used to have more rich type information, and we did decide that offsets calculated at parse time were preferable to that type information, although I thought we at one point had something like [1] in the tree, which we didn't. Anyway the decision to use offsets calculated at parse time seems to have been motivated only by simplicity. To be fair, at the time, it *was* simpler.
- [1] and the later patch that replaced it were mostly motivated by RA. We will probably end up doing RA after SMxIR translation, but we may very likely do RA *before* it as well (tracking e.g. SMx instructions with register numbers instead of having def-use chains.) A more salient concern is that I still don't like the idea of having instructions in the tree that aren't actually translated (or translatable) to SMxIR, which means that we shouldn't have instructions that yield e.g. structs.
The ugliness that we've run into is: how do we emit IR for the following variable load?
struct apple { int a; struct { Texture2D b; int c; } s; } a;
/* in some expression */ func(a.s);
Unlike the SM1 example above, the register numbers don't match up. Separately, it's kind of ugly that backend-specific details regarding register size and alignment are leaking into the frontend so much. Similarly, the amount of code that has to deal with matrix majority is unfortunate.
The former problem can potentially be solved by embedding multiple register offsets into hlsl_deref (one per register type). Neither this nor the latter problem are prohibitive, and I was at one point in favour of continuing to use register offsets everywhere, but at this point my feeling has changed, and I think using register offsets is looking more ugly than the alternatives. I get the impression that Francisco disagrees, though, which is why we should probably hash this out now.
Nor do I think we should use both register offsets and component offsets (either in the same node type, or in different node types). That just makes the IR way more complicated. Rather, I think we should be doing everything in *just* component offsets until translation from HLSL IR to SMx IR.
In order to deal with the problem of translating dynamic offsets from components to registers, I see three options:
(a) emit code at runtime, or do some sophisticated lowering,
(b) use special offsetof and sizeof nodes,
(c) introduce a structured deref type, much like [1]. Francisco was actually proposing something like this, although with an array instead of a recursive structure, which strikes me as an improvement.
My guess is that (a) is very hard. I haven't really tried to reason it out, though.
Given a choice between (b) and (c), I'm more inclined to pick (c). It makes the IR structure more restrictive, and those restrictions fundamentally match the structured nature of the language we're working with, both things I tend to like.
Note that either way we're going to need specialized functions to resolve deref offsets in one step. I also think that should depend on the domain—e.g. for copy-prop we'll actually want to do everything in component counts, but when translating to SMxIR we'll evaluate given the register alignment constraints of the shader model. In the case of (b) it's not going to be as simple as running the existing constant folding pass, because we can't actually fold the sizeof/offsetof constants (unless we dup the node list, evaluate, and then fold, which seems very hairy and more work than the alternative).
I invite thoughts—especially from Matteo, since we discussed this sort of problem ages ago.
ἔρρωσθε, Zeb
[1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html
[2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html
First of all, thanks for starting this thread: very nice overview of the problem and what we tried and figured out over time.
On Thu, Jun 9, 2022 at 3:33 AM Zebediah Figura zfigura@codeweavers.com wrote:
The following thread is based partly on, and makes reference to, private conversation, but for the sake of openness I've elected to post it to wine-devel.
A long time ago, HLSL_IR_LOAD—then called HLSL_IR_DEREF—was this:
enum hlsl_ir_deref_type { HLSL_IR_DEREF_VAR, HLSL_IR_DEREF_ARRAY, HLSL_IR_DEREF_RECORD, }; struct hlsl_deref { enum hlsl_ir_deref_type type; union { struct hlsl_ir_var *var; struct { struct hlsl_ir_node *array; struct hlsl_ir_node *index; } array; struct { struct hlsl_ir_node *record; struct hlsl_struct_field *field; } record; } v; }; struct hlsl_ir_deref { struct hlsl_ir_node node; struct hlsl_deref src; };
Now, one problem with this is that it was kind of mean to RA and liveness analysis. For example, a line of HLSL like
var.a.b = 2.0;
produced the following IR:
2: 2.0 3: deref(var) 4: @3.b 5: @4.c = @2
This is annoying because:
- to discover that "var" is written, @5 needs to reach upwards through
a deref chain;
reaching through the deref chain requires lots of assert() statements;
@3 implies that "var" is read, which it isn't (and, if we reach
upwards through the deref chain, @4 implies the same thing).
I proposed that instead of using generic node pointers, we could have arbitrarily long deref chains encoded in the hlsl_deref structure itself. [1]
There was some discussion on that—which is mostly concentrated in that thread, and also IRC. Most of the concern is about being nicer to liveness analysis and RA.
What ultimately ended up happening is that Matteo proposed numeric (register) offsets calculated at parse time, which is fundamentally similar to my idea except that it's a lot simpler to work with.
Interestingly, the problem of multiple register sets was brought up [2]:
From my testing it essentially does, yes, i.e. if you have struct { int unused; float f; bool b; } s; float4 main(float4 pos : POSITION) : POSITION { if (s.b) return s.f; return pos; } then "s" gets allocated to registers b0-b2 and c0-c1, but only b2 and c1 are ever used. So yeah, it makes things pretty simple. I can see how it would have been a lot uglier otherwise.
I guess we've finally run into that ugliness now :-(
The ultimate conclusions to draw from this historical exercise are:
- what I said about "we used to have derefs handled like that" is mostly
correct, although not quite. We did used to have more rich type information, and we did decide that offsets calculated at parse time were preferable to that type information, although I thought we at one point had something like [1] in the tree, which we didn't. Anyway the decision to use offsets calculated at parse time seems to have been motivated only by simplicity. To be fair, at the time, it *was* simpler.
It still is simpler, baseline. Unfortunately it's now clear that in general it comes with a cost i.e. register offsets aka SM-dependent details leaking through the higher level IR. I don't think that should automatically make the current solution null and void though.
I'll also note that we could potentially support two different ways to do "derefs", the current one and something similar to the old way (or one of the replacements mentioned later). Probably not worth the complexity but it is a possibility.
- [1] and the later patch that replaced it were mostly motivated by RA.
We will probably end up doing RA after SMxIR translation, but we may very likely do RA *before* it as well (tracking e.g. SMx instructions with register numbers instead of having def-use chains.)
Yes. We should at the very least keep the door open to that.
A more salient concern is that I still don't like the idea of having instructions in the tree that aren't actually translated (or translatable) to SMxIR, which means that we shouldn't have instructions that yield e.g. structs.
It seems like a fair design choice.
The ugliness that we've run into is: how do we emit IR for the following variable load?
struct apple { int a; struct { Texture2D b; int c; } s; } a; /* in some expression */ func(a.s);
Unlike the SM1 example above, the register numbers don't match up. Separately, it's kind of ugly that backend-specific details regarding register size and alignment are leaking into the frontend so much.
I think most of that can be hidden or contained with some proper abstraction. And generous handwaving. But basically, that probably could be represented in the IR as copying around individual fields of the structure separately, rather than a single "struct deref". Clearly it can become more complex depending on the type of the variable but I think it should be doable.
Similarly, the amount of code that has to deal with matrix majority is unfortunate.
That personally seems more annoying. Although it's not clear to me that handling matrix majority at a later stage is necessarily any better.
The former problem can potentially be solved by embedding multiple register offsets into hlsl_deref (one per register type). Neither this nor the latter problem are prohibitive, and I was at one point in favour of continuing to use register offsets everywhere, but at this point my feeling has changed, and I think using register offsets is looking more ugly than the alternatives. I get the impression that Francisco disagrees, though, which is why we should probably hash this out now.
As I mention below, I currently see two options as the most appealing. This one (multiple register offsets) sits somewhat in the middle and it feels like it would be best to go to one of the extremes instead. It's also possible that this middle ground solution would end up being nicer in practice. At any rate, I certainly wouldn't flat out discount it.
Nor do I think we should use both register offsets and component offsets (either in the same node type, or in different node types). That just makes the IR way more complicated. Rather, I think we should be doing everything in *just* component offsets until translation from HLSL IR to SMx IR.
I touched on this earlier and I agree that the additional complexity is unlikely to be worth it. Admittedly we're in a limbo right now where SMxIR isn't quite there yet, which makes reasoning on some of these details a bit fuzzy.
In order to deal with the problem of translating dynamic offsets from components to registers, I see three options:
(a) emit code at runtime, or do some sophisticated lowering,
(b) use special offsetof and sizeof nodes,
(c) introduce a structured deref type, much like [1]. Francisco was actually proposing something like this, although with an array instead of a recursive structure, which strikes me as an improvement.
My guess is that (a) is very hard. I haven't really tried to reason it out, though.
Given a choice between (b) and (c), I'm more inclined to pick (c). It makes the IR structure more restrictive, and those restrictions fundamentally match the structured nature of the language we're working with, both things I tend to like.
After giving it some thought I think that's certainly fine *for the higher level IR*. At the same time it seems to me that, if we go that route, eventually we also want to have real SMxIR with register offsets, and make sure that we can optimize constant offsets (thus expressions) at that level.
As I see it (as of current time and date, can't guarantee that I won't change my mind again...) we either push the backend-specific info up (register offsets all the way) or down (component offsets with structured deref / type info in the generic IR, transformation into register offsets in the SMxIR). I think either option works and it's mostly a matter of preference and which one fits / feels better with the rest of the compiler.
Note that either way we're going to need specialized functions to resolve deref offsets in one step. I also think that should depend on the domain—e.g. for copy-prop we'll actually want to do everything in component counts, but when translating to SMxIR we'll evaluate given the register alignment constraints of the shader model. In the case of (b) it's not going to be as simple as running the existing constant folding pass, because we can't actually fold the sizeof/offsetof constants (unless we dup the node list, evaluate, and then fold, which seems very hairy and more work than the alternative).
Right, each option will have different tradeoffs WRT optimization passes. But e.g. copy-prop should be doable even with register offsets, we "just" need to make sure to always map the component offsets to their respective register offsets.
I invite thoughts—especially from Matteo, since we discussed this sort of problem ages ago.
Yep, hope that my comments make sense. I want to hear from the others too.
ἔρρωσθε, Zeb
[1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html
[2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html
On 6/9/22 04:04, Matteo Bruni wrote:
The ugliness that we've run into is: how do we emit IR for the following variable load?
struct apple { int a; struct { Texture2D b; int c; } s; } a; /* in some expression */ func(a.s);
Unlike the SM1 example above, the register numbers don't match up. Separately, it's kind of ugly that backend-specific details regarding register size and alignment are leaking into the frontend so much.
I think most of that can be hidden or contained with some proper abstraction. And generous handwaving. But basically, that probably could be represented in the IR as copying around individual fields of the structure separately, rather than a single "struct deref". Clearly it can become more complex depending on the type of the variable but I think it should be doable.
Yeah, it could. Like I said it's not prohibitive. I'm just not sure it's the best option at this point.
It's worth pointing out that, at parse time, we want and need for load instructions (and therefore probably also store instructions) to have larger-than-vector types—that is, load instructions can produce structs, and store instructions can consume them. But we don't want that for SMxIR, and I believe we don't want that for the "final form" of HLSL IR either. That's the way the code is currently arranged and I see no reason not to keep it that way.
Similarly, the amount of code that has to deal with matrix majority is unfortunate.
That personally seems more annoying. Although it's not clear to me that handling matrix majority at a later stage is necessarily any better.
The main idea is that we could handle it something closer to once (well, once per backend), at HLSL -> SMx translation.
That doesn't necessarily mean requiring that all matrix loads and stores are done on a single scalar—after all, we could translate a single vector load to multiple MOV instructions if it can't actually be represented by one.
It does potentially mean doing vectorization passes on SMxIR, though. Hard to tell this far in advance, and it's also hard to tell if that's something we're going to need anyway.
The former problem can potentially be solved by embedding multiple register offsets into hlsl_deref (one per register type). Neither this nor the latter problem are prohibitive, and I was at one point in favour of continuing to use register offsets everywhere, but at this point my feeling has changed, and I think using register offsets is looking more ugly than the alternatives. I get the impression that Francisco disagrees, though, which is why we should probably hash this out now.
As I mention below, I currently see two options as the most appealing. This one (multiple register offsets) sits somewhat in the middle and it feels like it would be best to go to one of the extremes instead. It's also possible that this middle ground solution would end up being nicer in practice. At any rate, I certainly wouldn't flat out discount it.
Nor do I think we should use both register offsets and component offsets (either in the same node type, or in different node types). That just makes the IR way more complicated. Rather, I think we should be doing everything in *just* component offsets until translation from HLSL IR to SMx IR.
I touched on this earlier and I agree that the additional complexity is unlikely to be worth it. Admittedly we're in a limbo right now where SMxIR isn't quite there yet, which makes reasoning on some of these details a bit fuzzy.
In order to deal with the problem of translating dynamic offsets from components to registers, I see three options:
(a) emit code at runtime, or do some sophisticated lowering,
(b) use special offsetof and sizeof nodes,
(c) introduce a structured deref type, much like [1]. Francisco was actually proposing something like this, although with an array instead of a recursive structure, which strikes me as an improvement.
My guess is that (a) is very hard. I haven't really tried to reason it out, though.
Given a choice between (b) and (c), I'm more inclined to pick (c). It makes the IR structure more restrictive, and those restrictions fundamentally match the structured nature of the language we're working with, both things I tend to like.
After giving it some thought I think that's certainly fine *for the higher level IR*. At the same time it seems to me that, if we go that route, eventually we also want to have real SMxIR with register offsets, and make sure that we can optimize constant offsets (thus expressions) at that level.
As I see it (as of current time and date, can't guarantee that I won't change my mind again...) we either push the backend-specific info up (register offsets all the way) or down (component offsets with structured deref / type info in the generic IR, transformation into register offsets in the SMxIR). I think either option works and it's mostly a matter of preference and which one fits / feels better with the rest of the compiler.
Yeah, that general approach makes sense to me. And yes, of course the SMxIR should deal entirely in register offsets.
My current vision of SMxIR is that it should be a one-to-one representation of actual instructions, writable without any lowering passes (and hence any passes that are done on it should be optimization only, with the *possible* exception of RA.) In a sense, it's what we have already with sm4_instruction and such, except that we'd be storing it and doing passes on it rather than just writing it directly.
Between those two extremes—well, what we currently have basically *is* the first extreme, with register offsets pushed all the way up to parse time. It's just causing some friction that makes me think the latter extreme is probably going to be pretty.
Note that either way we're going to need specialized functions to resolve deref offsets in one step. I also think that should depend on the domain—e.g. for copy-prop we'll actually want to do everything in component counts, but when translating to SMxIR we'll evaluate given the register alignment constraints of the shader model. In the case of (b) it's not going to be as simple as running the existing constant folding pass, because we can't actually fold the sizeof/offsetof constants (unless we dup the node list, evaluate, and then fold, which seems very hairy and more work than the alternative).
Right, each option will have different tradeoffs WRT optimization passes. But e.g. copy-prop should be doable even with register offsets, we "just" need to make sure to always map the component offsets to their respective register offsets.
Quite, in fact we're already doing it that way. But it's probably better to work with components, since we (a) don't waste space tracking padding [not very important], and (b) don't have to deal with multiple register sets [more important].
I invite thoughts—especially from Matteo, since we discussed this sort of problem ages ago.
Yep, hope that my comments make sense. I want to hear from the others too.
ἔρρωσθε, Zeb
[1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html
[2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html
On Thu, Jun 9, 2022 at 9:17 PM Zebediah Figura zfigura@codeweavers.com wrote:
On 6/9/22 04:04, Matteo Bruni wrote:
The ugliness that we've run into is: how do we emit IR for the following variable load?
struct apple { int a; struct { Texture2D b; int c; } s; } a; /* in some expression */ func(a.s);
Unlike the SM1 example above, the register numbers don't match up. Separately, it's kind of ugly that backend-specific details regarding register size and alignment are leaking into the frontend so much.
I think most of that can be hidden or contained with some proper abstraction. And generous handwaving. But basically, that probably could be represented in the IR as copying around individual fields of the structure separately, rather than a single "struct deref". Clearly it can become more complex depending on the type of the variable but I think it should be doable.
Yeah, it could. Like I said it's not prohibitive. I'm just not sure it's the best option at this point.
It's worth pointing out that, at parse time, we want and need for load instructions (and therefore probably also store instructions) to have larger-than-vector types—that is, load instructions can produce structs, and store instructions can consume them. But we don't want that for SMxIR, and I believe we don't want that for the "final form" of HLSL IR either. That's the way the code is currently arranged and I see no reason not to keep it that way.
Right, and agreed, but that's currently taken care of by split_struct_copies() and I don't think that's a (large) problem for the register offset approach.
Similarly, the amount of code that has to deal with matrix majority is unfortunate.
That personally seems more annoying. Although it's not clear to me that handling matrix majority at a later stage is necessarily any better.
The main idea is that we could handle it something closer to once (well, once per backend), at HLSL -> SMx translation.
That doesn't necessarily mean requiring that all matrix loads and stores are done on a single scalar—after all, we could translate a single vector load to multiple MOV instructions if it can't actually be represented by one.
It does potentially mean doing vectorization passes on SMxIR, though. Hard to tell this far in advance, and it's also hard to tell if that's something we're going to need anyway.
Yeah, that's the sort of possible snag that I was thinking about WRT moving the matrix majority handling "down".
The former problem can potentially be solved by embedding multiple register offsets into hlsl_deref (one per register type). Neither this nor the latter problem are prohibitive, and I was at one point in favour of continuing to use register offsets everywhere, but at this point my feeling has changed, and I think using register offsets is looking more ugly than the alternatives. I get the impression that Francisco disagrees, though, which is why we should probably hash this out now.
As I mention below, I currently see two options as the most appealing. This one (multiple register offsets) sits somewhat in the middle and it feels like it would be best to go to one of the extremes instead. It's also possible that this middle ground solution would end up being nicer in practice. At any rate, I certainly wouldn't flat out discount it.
Nor do I think we should use both register offsets and component offsets (either in the same node type, or in different node types). That just makes the IR way more complicated. Rather, I think we should be doing everything in *just* component offsets until translation from HLSL IR to SMx IR.
I touched on this earlier and I agree that the additional complexity is unlikely to be worth it. Admittedly we're in a limbo right now where SMxIR isn't quite there yet, which makes reasoning on some of these details a bit fuzzy.
In order to deal with the problem of translating dynamic offsets from components to registers, I see three options:
(a) emit code at runtime, or do some sophisticated lowering,
(b) use special offsetof and sizeof nodes,
(c) introduce a structured deref type, much like [1]. Francisco was actually proposing something like this, although with an array instead of a recursive structure, which strikes me as an improvement.
My guess is that (a) is very hard. I haven't really tried to reason it out, though.
Given a choice between (b) and (c), I'm more inclined to pick (c). It makes the IR structure more restrictive, and those restrictions fundamentally match the structured nature of the language we're working with, both things I tend to like.
After giving it some thought I think that's certainly fine *for the higher level IR*. At the same time it seems to me that, if we go that route, eventually we also want to have real SMxIR with register offsets, and make sure that we can optimize constant offsets (thus expressions) at that level.
As I see it (as of current time and date, can't guarantee that I won't change my mind again...) we either push the backend-specific info up (register offsets all the way) or down (component offsets with structured deref / type info in the generic IR, transformation into register offsets in the SMxIR). I think either option works and it's mostly a matter of preference and which one fits / feels better with the rest of the compiler.
Yeah, that general approach makes sense to me. And yes, of course the SMxIR should deal entirely in register offsets.
My current vision of SMxIR is that it should be a one-to-one representation of actual instructions, writable without any lowering passes (and hence any passes that are done on it should be optimization only, with the *possible* exception of RA.) In a sense, it's what we have already with sm4_instruction and such, except that we'd be storing it and doing passes on it rather than just writing it directly.
Right, I agree with the general idea. In practice it might turn out to be useful to relax the 1:1 requirements a bit and introduce some "extra" instructions (that would be quickly lowered to real ones) if that makes the HIR->SMxIR conversion easier.
Something related we already discussed: SM1IR and SM4IR are going to be different, obviously, but we want to try and architect them so that the optimization passes machinery and most of the actual passes can be shared between the two.
Between those two extremes—well, what we currently have basically *is* the first extreme, with register offsets pushed all the way up to parse time. It's just causing some friction that makes me think the latter extreme is probably going to be pretty.
It is, but it sort of happened by accident. My guess is that an explicit effort to "clean up" the compiler around register offsets from the top might lessen some of that friction.
It's fair to say that, at this point, we probably have more visibility into the latter, which arguably makes that option more compelling.
Note that either way we're going to need specialized functions to resolve deref offsets in one step. I also think that should depend on the domain—e.g. for copy-prop we'll actually want to do everything in component counts, but when translating to SMxIR we'll evaluate given the register alignment constraints of the shader model. In the case of (b) it's not going to be as simple as running the existing constant folding pass, because we can't actually fold the sizeof/offsetof constants (unless we dup the node list, evaluate, and then fold, which seems very hairy and more work than the alternative).
Right, each option will have different tradeoffs WRT optimization passes. But e.g. copy-prop should be doable even with register offsets, we "just" need to make sure to always map the component offsets to their respective register offsets.
Quite, in fact we're already doing it that way. But it's probably better to work with components, since we (a) don't waste space tracking padding [not very important], and (b) don't have to deal with multiple register sets [more important].
Yeah. I guess my point would have been better served by pointing out passes that become easier or more powerful with register offsets, like probably vectorization.
I invite thoughts—especially from Matteo, since we discussed this sort of problem ages ago.
Yep, hope that my comments make sense. I want to hear from the others too.
ἔρρωσθε, Zeb
[1] https://www.winehq.org/pipermail/wine-devel/2020-April/164399.html
[2] https://www.winehq.org/pipermail/wine-devel/2020-April/165493.html
On 6/9/22 17:38, Matteo Bruni wrote:
My current vision of SMxIR is that it should be a one-to-one representation of actual instructions, writable without any lowering passes (and hence any passes that are done on it should be optimization only, with the *possible* exception of RA.) In a sense, it's what we have already with sm4_instruction and such, except that we'd be storing it and doing passes on it rather than just writing it directly.
Right, I agree with the general idea. In practice it might turn out to be useful to relax the 1:1 requirements a bit and introduce some "extra" instructions (that would be quickly lowered to real ones) if that makes the HIR->SMxIR conversion easier.
Something related we already discussed: SM1IR and SM4IR are going to be different, obviously, but we want to try and architect them so that the optimization passes machinery and most of the actual passes can be shared between the two.
Hmm, I don't remember that discussion. My take is that if there's only two backends it probably makes more sense not to (over-)engineer any common framework, but rather just make them look similar enough that we can copy-paste things from one to the other. (It's an unpopular opinion, perhaps, but I think that code duplication is an important and often the best method of code deduplication.)
Of course, I don't know what the sm6 backend will look like, when and if we add it.
On Fri, Jun 10, 2022 at 1:23 AM Zebediah Figura zfigura@codeweavers.com wrote:
On 6/9/22 17:38, Matteo Bruni wrote:
My current vision of SMxIR is that it should be a one-to-one representation of actual instructions, writable without any lowering passes (and hence any passes that are done on it should be optimization only, with the *possible* exception of RA.) In a sense, it's what we have already with sm4_instruction and such, except that we'd be storing it and doing passes on it rather than just writing it directly.
Right, I agree with the general idea. In practice it might turn out to be useful to relax the 1:1 requirements a bit and introduce some "extra" instructions (that would be quickly lowered to real ones) if that makes the HIR->SMxIR conversion easier.
Something related we already discussed: SM1IR and SM4IR are going to be different, obviously, but we want to try and architect them so that the optimization passes machinery and most of the actual passes can be shared between the two.
Hmm, I don't remember that discussion. My take is that if there's only two backends it probably makes more sense not to (over-)engineer any common framework, but rather just make them look similar enough that we can copy-paste things from one to the other. (It's an unpopular opinion, perhaps, but I think that code duplication is an important and often the best method of code deduplication.)
Actually I think we had this same conversation before and I totally misremembered your side.
In general, sure, deduplicating stuff a posteriori is the safest route. Also I think I imagined struct sm1_instruction and struct sm4_instruction being more similar to each other than they actually are. So maybe keeping the current separation is a smart idea. That said, I don't think it's a terrible move to try and keep them close-ish where it doesn't get in the way. E.g. the sm1 dest register modifiers are effectively the same thing as the sm4 instruction modifiers (only 1 dest register in SM1 instruction encoding, arguably the fact that these modifiers are part of the dest register is more of an encoding artifact than anything else) so it should be no trouble to call those "instruction modifiers" in both cases.
So, as Matteo summarized, we are between 2 main options:
a) Multiple register offsets. b) Component offsets with structured dereference info.
How I see it, (a) changes hlsl_deref to:
--- enum register_set { HLSL_REGSET_OBJ, HLSL_REGSET_NUM, /* ... add more as needed, to cover for all SMs. */
HLSL_REGSET_COUNT, };
struct hlsl_deref { struct hlsl_ir_var *var; struct hlsl_src offset[HLSL_REGSET_COUNT]; }; ---
Also, the types' reg_size becomes reg_size[HLSL_REGSET_COUNT], and so do field offsets. Many functions have to receive an additional register_set argument or an array of offsets instead of a single offset.
On the other hand, the version of (b) I imagine changes hlsl_deref to: --- struct hlsl_deref { struct hlsl_ir_var *var; unsigned int route_len; struct hlsl_src *route; }; --- Where route is intended to be a variable size array of component offsets. It would make sense to remove reg_size from the types and also field offsets. Functions that cannot receive structs or arrays may receive a "flattened" component offset that can then be translated into a route, other functions would require the route as an array.
At this point I can see the benefits of (b) over (a), but also, several complications that may arise (you have pointed most of them): - We will have to translate many things that are already in terms of register offsets into component offsets. - We will have to move all optimization passes (like vectorization) that require register offsets to their specific SMxIR, RA too. - We will have to do the proper translation to register offsets, on each SMxIR level, and probably some sort of constant folding for them. - Once we start supporting non-constant offsets, we may also want to introduce a common subexpression elimination pass for the registers offsets expressions that will arise (currently, the creation of common expressions is mainly avoided by the recursive structure of the split passes).
Solely because I have spent a considerable amount of time implementing option (a) (and some time giving up on implementing component offsets as a single hlsl_src, instead of a path) I am rushing (a) to see how the patch turns out in the end, before trying (b).
I so think that (a) can be cleansed one step at the time. Even if the register sizes and offsets depend on the SM, we can write an interface to abstract the rest of the code of using them directly, and gradually migrate the code that does to use this interface instead.
But so far, yeah, I am being convinced that (b) is better, however more difficult. If we do (b), I suggest we try to do it in separate steps, with some "scaffolding code":
- Add the component offset route to hlsl_deref without deleting the register offset. - Create a simple pass that initializes the register offset field using the route in all the hlsl_derefs. - Translate all the parse code first to work with the routes, and apply the pass just after parsing. - Translate some more compilation passes and move the translation pass forward. - Repeat until all the passes that can be written in terms of component offsets are. - Write the SMxIR·s and the SMxIR translations. - Only then, remove the the register offset from hlsl_deref and the translation pass from the codebase.
Otherwise we may end up writing a very big patch that may take too long to complete (!).
Best regards, Francisco.
On 6/10/22 14:13, Francisco Casas wrote:
So, as Matteo summarized, we are between 2 main options:
a) Multiple register offsets. b) Component offsets with structured dereference info.
How I see it, (a) changes hlsl_deref to:
enum register_set { HLSL_REGSET_OBJ, HLSL_REGSET_NUM, /* ... add more as needed, to cover for all SMs. */
HLSL_REGSET_COUNT, };
struct hlsl_deref { struct hlsl_ir_var *var; struct hlsl_src offset[HLSL_REGSET_COUNT]; };
Also, the types' reg_size becomes reg_size[HLSL_REGSET_COUNT], and so do field offsets. Many functions have to receive an additional register_set argument or an array of offsets instead of a single offset.
On the other hand, the version of (b) I imagine changes hlsl_deref to:
struct hlsl_deref { struct hlsl_ir_var *var; unsigned int route_len; struct hlsl_src *route; };
Where route is intended to be a variable size array of component offsets.
I was envisioning something more explicit, but this is simpler, so my guess is that this is what we want.
It would make sense to remove reg_size from the types and also field offsets.
Yes, absolutely.
Functions that cannot receive structs or arrays may receive a "flattened" component offset that can then be translated into a route, other functions would require the route as an array.
Not immediately sure what functions you're thinking of, but I imagine things like hlsl_compute_component_offset() would now have to translate the offset into an array.
At this point I can see the benefits of (b) over (a), but also, several complications that may arise (you have pointed most of them):
- We will have to translate many things that are already in terms of
register offsets into component offsets.
How many things, though? As far as I can see it's:
- copy-prop
- copy splitting
- hlsl_offset_from_deref() [which is part of the point of the whole exercise]
That's pretty much it.
- We will have to move all optimization passes (like vectorization) that
require register offsets to their specific SMxIR, RA too.
I think we want to do RA per backend anyway. (It kind of already is per-backend, in a very awkward way.)
Vectorization is the main downside but it's possible that we wanted that to be per-backend too.
- We will have to do the proper translation to register offsets, on each
SMxIR level, and probably some sort of constant folding for them.
Sure, but in a sense this is the point. The translation is backend-specific (what with alignment and register sets and all) and we should structure things accordingly.
hlsl_offset_from_deref() will need constant folding in a sense, but it'll be a pretty restricted form thereof.
- Once we start supporting non-constant offsets, we may also want to
introduce a common subexpression elimination pass for the registers offsets expressions that will arise (currently, the creation of common expressions is mainly avoided by the recursive structure of the split passes).
What cases are you thinking of that would want CSE?
Solely because I have spent a considerable amount of time implementing option (a) (and some time giving up on implementing component offsets as a single hlsl_src, instead of a path) I am rushing (a) to see how the patch turns out in the end, before trying (b).
I so think that (a) can be cleansed one step at the time. Even if the register sizes and offsets depend on the SM, we can write an interface to abstract the rest of the code of using them directly, and gradually migrate the code that does to use this interface instead.
But so far, yeah, I am being convinced that (b) is better, however more difficult. If we do (b), I suggest we try to do it in separate steps, with some "scaffolding code":
- Add the component offset route to hlsl_deref without deleting the
register offset.
- Create a simple pass that initializes the register offset field using
the route in all the hlsl_derefs.
- Translate all the parse code first to work with the routes, and apply
the pass just after parsing.
- Translate some more compilation passes and move the translation pass
forward.
- Repeat until all the passes that can be written in terms of component
offsets are.
- Write the SMxIR·s and the SMxIR translations.
- Only then, remove the the register offset from hlsl_deref and the
translation pass from the codebase.
Otherwise we may end up writing a very big patch that may take too long to complete (!).
Ech, I think that trying to do two things at once is going to be more confusing than the alternative. I also don't think there's *that* much code that we'd have to change, i.e. a monolithic patch wouldn't be that bad.
For that matter, I'd be happy to try writing those patches myself :-)
ἔρρωσθε, Zeb
Nothing to say about the last two replies (i.e. I generally agree), just a couple of side-comments.
On Fri, Jun 10, 2022 at 11:13 PM Zebediah Figura zfigura@codeweavers.com wrote:
On 6/10/22 14:13, Francisco Casas wrote:
So, as Matteo summarized, we are between 2 main options:
a) Multiple register offsets. b) Component offsets with structured dereference info.
How I see it, (a) changes hlsl_deref to:
enum register_set { HLSL_REGSET_OBJ, HLSL_REGSET_NUM, /* ... add more as needed, to cover for all SMs. */
HLSL_REGSET_COUNT,
};
struct hlsl_deref { struct hlsl_ir_var *var; struct hlsl_src offset[HLSL_REGSET_COUNT]; };
Also, the types' reg_size becomes reg_size[HLSL_REGSET_COUNT], and so do field offsets. Many functions have to receive an additional register_set argument or an array of offsets instead of a single offset.
On the other hand, the version of (b) I imagine changes hlsl_deref to:
struct hlsl_deref { struct hlsl_ir_var *var; unsigned int route_len; struct hlsl_src *route; };
Where route is intended to be a variable size array of component offsets.
I was envisioning something more explicit, but this is simpler, so my guess is that this is what we want.
This is roughly what I was thinking too, although we probably want to find a better name than "route" :D
It would make sense to remove reg_size from the types and also field offsets.
Yes, absolutely.
Functions that cannot receive structs or arrays may receive a "flattened" component offset that can then be translated into a route, other functions would require the route as an array.
Not immediately sure what functions you're thinking of, but I imagine things like hlsl_compute_component_offset() would now have to translate the offset into an array.
At this point I can see the benefits of (b) over (a), but also, several complications that may arise (you have pointed most of them):
- We will have to translate many things that are already in terms of
register offsets into component offsets.
How many things, though? As far as I can see it's:
copy-prop
copy splitting
hlsl_offset_from_deref() [which is part of the point of the whole
exercise]
That's pretty much it.
- We will have to move all optimization passes (like vectorization) that
require register offsets to their specific SMxIR, RA too.
I think we want to do RA per backend anyway. (It kind of already is per-backend, in a very awkward way.)
Yes indeed, it really is. That's one of the places where, I think, we could potentially gain in clarity by making the per-backend stuff more explicit. If we were going for register offsets all the way.
Vectorization is the main downside but it's possible that we wanted that to be per-backend too.
And this is where it might turn out to be helpful to share the same fundamental structure between backends / SMxIR. Again, sharing stuff bottom-up, or after the fact, is certainly fine to me.
- We will have to do the proper translation to register offsets, on each
SMxIR level, and probably some sort of constant folding for them.
Sure, but in a sense this is the point. The translation is backend-specific (what with alignment and register sets and all) and we should structure things accordingly.
hlsl_offset_from_deref() will need constant folding in a sense, but it'll be a pretty restricted form thereof.
- Once we start supporting non-constant offsets, we may also want to
introduce a common subexpression elimination pass for the registers offsets expressions that will arise (currently, the creation of common expressions is mainly avoided by the recursive structure of the split passes).
What cases are you thinking of that would want CSE?
The most obvious case I can think of is when some non-trivial variable is dynamically accessed multiple times inside a loop at the same, or similar, offset (e.g. accessing multiple fields of the same element of an array of structures).
Solely because I have spent a considerable amount of time implementing option (a) (and some time giving up on implementing component offsets as a single hlsl_src, instead of a path) I am rushing (a) to see how the patch turns out in the end, before trying (b).
I so think that (a) can be cleansed one step at the time. Even if the register sizes and offsets depend on the SM, we can write an interface to abstract the rest of the code of using them directly, and gradually migrate the code that does to use this interface instead.
But so far, yeah, I am being convinced that (b) is better, however more difficult. If we do (b), I suggest we try to do it in separate steps, with some "scaffolding code":
- Add the component offset route to hlsl_deref without deleting the
register offset.
- Create a simple pass that initializes the register offset field using
the route in all the hlsl_derefs.
- Translate all the parse code first to work with the routes, and apply
the pass just after parsing.
- Translate some more compilation passes and move the translation pass
forward.
- Repeat until all the passes that can be written in terms of component
offsets are.
- Write the SMxIR·s and the SMxIR translations.
- Only then, remove the the register offset from hlsl_deref and the
translation pass from the codebase.
Otherwise we may end up writing a very big patch that may take too long to complete (!).
Ech, I think that trying to do two things at once is going to be more confusing than the alternative. I also don't think there's *that* much code that we'd have to change, i.e. a monolithic patch wouldn't be that bad.
For that matter, I'd be happy to try writing those patches myself :-)
ἔρρωσθε, Zeb
Hello,
Functions that cannot receive structs or arrays may receive a "flattened" component offset that can then be translated into a route, other functions would require the route as an array.
Not immediately sure what functions you're thinking of, but I imagine things like hlsl_compute_component_offset() would now have to translate the offset into an array.
For instance, hlsl_new_load(), hlsl_new_resource_load(), and hlsl_new_store() will have to receive routes instead of single nodes as offsets now.
At this point I can see the benefits of (b) over (a), but also, several complications that may arise (you have pointed most of them):
- We will have to translate many things that are already in terms of
register offsets into component offsets.
How many things, though? As far as I can see it's:
copy-prop
copy splitting
hlsl_offset_from_deref() [which is part of the point of the whole
exercise]
That's pretty much it.
Well, maybe it is not too much, but we also have to add initializers to the list and everywhere else the previous 3 functions are called.
- Once we start supporting non-constant offsets, we may also want to
introduce a common subexpression elimination pass for the registers offsets expressions that will arise (currently, the creation of common expressions is mainly avoided by the recursive structure of the split passes).
What cases are you thinking of that would want CSE?
Matteo pointed it out. Basically if we have several copies that come from a deep struct, let's say 4-dimension array:
float4 arr[10][10][10][10];
Consider the following 2 loads:
arr[i][j][k][0] arr[i][j][k][1]
The idea is that the common part of the SMx-specific register offsets (4000 * i + 400 * j + 40 * k) is shared among both and not computed twice.
Solely because I have spent a considerable amount of time implementing option (a) (and some time giving up on implementing component offsets as a single hlsl_src, instead of a path) I am rushing (a) to see how the patch turns out in the end, before trying (b).
I so think that (a) can be cleansed one step at the time. Even if the register sizes and offsets depend on the SM, we can write an interface to abstract the rest of the code of using them directly, and gradually migrate the code that does to use this interface instead.
But so far, yeah, I am being convinced that (b) is better, however more difficult. If we do (b), I suggest we try to do it in separate steps, with some "scaffolding code":
- Add the component offset route to hlsl_deref without deleting the
register offset.
- Create a simple pass that initializes the register offset field
using the route in all the hlsl_derefs.
- Translate all the parse code first to work with the routes, and
apply the pass just after parsing.
- Translate some more compilation passes and move the translation pass
forward.
- Repeat until all the passes that can be written in terms of
component offsets are.
- Write the SMxIR·s and the SMxIR translations.
- Only then, remove the the register offset from hlsl_deref and the
translation pass from the codebase.
Otherwise we may end up writing a very big patch that may take too long to complete (!).
Ech, I think that trying to do two things at once is going to be more confusing than the alternative. I also don't think there's *that* much code that we'd have to change, i.e. a monolithic patch wouldn't be that bad.
Okay, I stopped working on multiple register offsets and now I am working on this approach.
I still think that adding a pass to translate these "route"s to the original register "offset"s, so that we can implement this change gradually, is good. We can move the pass forward as we change more passes to work with component offsets. And we can debug more easily the things that we don't translate correctly.
I am aiming to implement the new approach until right after split_matrix_copies, and put the translation afterwards. So far it seems to be going nicely.
For that matter, I'd be happy to try writing those patches myself :-)
Give me a couple of days to finish a patch with the previous idea first. Even if we don't want to introduce the ugliness of this transitional state upstream, it may be a good starting point for the big patch.
ἔρρωσθε, Zeb
On 6/13/22 10:59, Francisco Casas wrote:
Hello,
Functions that cannot receive structs or arrays may receive a "flattened" component offset that can then be translated into a route, other functions would require the route as an array.
Not immediately sure what functions you're thinking of, but I imagine things like hlsl_compute_component_offset() would now have to translate the offset into an array.
For instance, hlsl_new_load(), hlsl_new_resource_load(), and hlsl_new_store() will have to receive routes instead of single nodes as offsets now.
They'll have to be adjusted, although there's probably simpler ways than to pass arrays to them directly. It'll take writing the patch probably.
At this point I can see the benefits of (b) over (a), but also, several complications that may arise (you have pointed most of them):
- We will have to translate many things that are already in terms of
register offsets into component offsets.
How many things, though? As far as I can see it's:
copy-prop
copy splitting
hlsl_offset_from_deref() [which is part of the point of the whole
exercise]
That's pretty much it.
Well, maybe it is not too much, but we also have to add initializers to the list and everywhere else the previous 3 functions are called.
- Once we start supporting non-constant offsets, we may also want to
introduce a common subexpression elimination pass for the registers offsets expressions that will arise (currently, the creation of common expressions is mainly avoided by the recursive structure of the split passes).
What cases are you thinking of that would want CSE?
Matteo pointed it out. Basically if we have several copies that come from a deep struct, let's say 4-dimension array:
float4 arr[10][10][10][10];
Consider the following 2 loads:
arr[i][j][k][0] arr[i][j][k][1]
The idea is that the common part of the SMx-specific register offsets (4000 * i + 400 * j + 40 * k) is shared among both and not computed twice.
Hmm, interesting.
That would be avoidable if we had e.g. HLSL_IR_LOAD instructions, but that still brings us back to the problem where we have unused loads we can't DCE.
My inclination is that yeah, if we end up caring about this we should just do a late CSE pass.
Hello,
On 13-06-22 11:59, Francisco Casas wrote:
Okay, I stopped working on multiple register offsets and now I am working on this approach.
I still think that adding a pass to translate these "route"s to the original register "offset"s, so that we can implement this change gradually, is good. We can move the pass forward as we change more passes to work with component offsets. And we can debug more easily the things that we don't translate correctly.
I am aiming to implement the new approach until right after split_matrix_copies, and put the translation afterwards. So far it seems to be going nicely.
I attach a patch that achieves this, in case there are any opinions.
The "paths"[1] of component indexes seem like the best way to solve the problem after all. I think that keeping them as an arrays of nodes in the hlsl_deref allows to write more uniform code than using a more complex data structure, as Giovanni mentioned.
IMO, I think this patch manages matrices nicely too, but there is the problem that it is based on top of --- bb49bdba gmascellani vkd3d-shader/hlsl: Allow majority modifiers on function declarations. --- which is a little behind of master.
So I still have to rebase it on top of current master, and in particular on top of the recent patches pertaining matrices. So I will probably have to make some design decisions while solving the conflicts.
I guess that the main objective is that we don't want to deal with matrices at all after splitting matrix copies.
Best regards, Francisco.
---
[1] I hope that this is a better name than "route".
On 6/14/22 22:05, Francisco Casas wrote:
Hello,
On 13-06-22 11:59, Francisco Casas wrote:
Okay, I stopped working on multiple register offsets and now I am working on this approach.
I still think that adding a pass to translate these "route"s to the original register "offset"s, so that we can implement this change gradually, is good. We can move the pass forward as we change more passes to work with component offsets. And we can debug more easily the things that we don't translate correctly.
I am aiming to implement the new approach until right after split_matrix_copies, and put the translation afterwards. So far it seems to be going nicely.
I attach a patch that achieves this, in case there are any opinions.
The "paths"[1] of component indexes seem like the best way to solve the problem after all. I think that keeping them as an arrays of nodes in the hlsl_deref allows to write more uniform code than using a more complex data structure, as Giovanni mentioned.
IMO, I think this patch manages matrices nicely too, but there is the problem that it is based on top of
bb49bdba gmascellani vkd3d-shader/hlsl: Allow majority modifiers on function declarations.
which is a little behind of master.
So I still have to rebase it on top of current master, and in particular on top of the recent patches pertaining matrices. So I will probably have to make some design decisions while solving the conflicts.
I guess that the main objective is that we don't want to deal with matrices at all after splitting matrix copies.
From skimming the patch, here are some immediate thoughts:
* I have a patch which has been around for a while which replaces the field linked list with an array. I think we wanted that anyway, but we especially want it given list_get().
* All current (upstream) uses of hlsl_new_store() and hlsl_new_load() with a nonzero offset fall into one of two categories:
- retrieving the element at a specific offset (initializer and matrix parsing)
- adding another offset to an existing hlsl_deref (splitting passes, initial parsing of field/element derefs)
I might have missed something, but assuming I've got that right, I think that the hlsl_new_load() and hlsl_new_store() should take arguments to reflect that. (I.e. split each into two functions, one taking an unsigned int and the other taking a const struct hlsl_deref *). In that case struct path_arg would be limited to hlsl.c, if it even needs to exist.
Note that this step could also be done independently of anything else, which would help make the "big" patch more reviewable.
* Similarly I think hlsl_new_resource_load() and friends should just take a "const struct hlsl_deref *" or two. I'm not immediately sure what to do with prepend_input_copy() and friends—maybe those should be converted to iterative passes instead of recursive ones.
* Stuff like free_hlsl_deref() can be made into a separate patch.
Hello, thanks for commenting on the patch!
On 15-06-22 19:20, Zebediah Figura wrote:
From skimming the patch, here are some immediate thoughts:
- I have a patch which has been around for a while which replaces the
field linked list with an array. I think we wanted that anyway, but we especially want it given list_get().
This seems like a good idea, I will see if I can add it before this patch.
- All current (upstream) uses of hlsl_new_store() and hlsl_new_load()
with a nonzero offset fall into one of two categories:
- retrieving the element at a specific offset (initializer and matrix parsing)
- adding another offset to an existing hlsl_deref (splitting passes, initial parsing of field/element derefs)
I might have missed something, but assuming I've got that right, I think that the hlsl_new_load() and hlsl_new_store() should take arguments to reflect that. (I.e. split each into two functions, one taking an unsigned int and the other taking a const struct hlsl_deref *). In that case struct path_arg would be limited to hlsl.c, if it even needs to exist.
Good observation, I can see some problems though:
a) The version of the function that receives the component index can generate the path of constant nodes alright, but it also needs to insert these nodes in the corresponding instruction list.
The instruction list can be received in this case and the new constant nodes can be added at the head -- or -- we can allow the "steps" in the path to be either an hlsl_src or an unsigned int, and then add a pass that adds the constant nodes at the beginning of the program: --- struct hlsl_step { struct hlsl_src *src; unsigned int c; /* used if src is NULL */ };
struct hlsl_deref { struct hlsl_ir_var *var;
unsigned int path_len; struct hlsl_step *path; }; --- I don't like this idea too much, but it also covers for the field accesses that Giovanni mentioned.
b) In my patch, accessing matrix components requires paths of length 2 (which I think is the right thing to do) so the version of the functions that receive a hlsl_deref should also receive 2 optional nodes instead of 1. So for instance, hlsl_new_store would be: --- struct hlsl_ir_store *hlsl_new_store(struct hlsl_ctx *ctx, struct hlsl_ir_var *var, const hlsl_deref *deref, struct hlsl_ir_node *node1, struct hlsl_ir_node *node2, struct hlsl_ir_node *rhs, unsigned int writemask, struct vkd3d_shader_location loc); --- this adds a little clutter.
c) We also have to change other functions, for instance, add_load() in upstream adds an arbitrary register offset to an hlsl_deref offset. In the patch, this generality is achieved by concatenating paths. Given the current uses of add_load() we would also need to create two versions of this function, the "hlsl_deref + 2 optional nodes" one and the "component index" one, and solve this particular case of problem (a) too.
I am not sure that loosing the generality of using path_arg is a good thing, it can simplify some calls to loads and stores but that's the only benefit I see (unless I am missing something). I can switch to this approach though, if we agree in a way of solving (a).
Note that this step could also be done independently of anything else, which would help make the "big" patch more reviewable.
You mean something like adding a "Replace path_arg with specific load and store functions." patch afterwards?
- Similarly I think hlsl_new_resource_load() and friends should just
take a "const struct hlsl_deref *" or two.
Yes, this seems consistent if we do stores and loads this way.
I'm not immediately sure what to do with prepend_input_copy() and friends—maybe those should be converted to iterative passes instead of recursive ones.
I don't see any problem keeping them as recursive passes.
- Stuff like free_hlsl_deref() can be made into a separate patch.
Hmm, sorry, I am not sure I see a clear division here. What other things could be in this separate patch?
On 6/16/22 11:48, Francisco Casas wrote:
Hello, thanks for commenting on the patch!
On 15-06-22 19:20, Zebediah Figura wrote:
From skimming the patch, here are some immediate thoughts:
- I have a patch which has been around for a while which replaces the
field linked list with an array. I think we wanted that anyway, but we especially want it given list_get().
This seems like a good idea, I will see if I can add it before this patch.
- All current (upstream) uses of hlsl_new_store() and hlsl_new_load()
with a nonzero offset fall into one of two categories:
- retrieving the element at a specific offset (initializer and matrix parsing)
- adding another offset to an existing hlsl_deref (splitting passes, initial parsing of field/element derefs)
I might have missed something, but assuming I've got that right, I think that the hlsl_new_load() and hlsl_new_store() should take arguments to reflect that. (I.e. split each into two functions, one taking an unsigned int and the other taking a const struct hlsl_deref *). In that case struct path_arg would be limited to hlsl.c, if it even needs to exist.
Good observation, I can see some problems though:
a) The version of the function that receives the component index can generate the path of constant nodes alright, but it also needs to insert these nodes in the corresponding instruction list.
The instruction list can be received in this case and the new constant nodes can be added at the head -- or -- we can allow the "steps" in the path to be either an hlsl_src or an unsigned int, and then add a pass that adds the constant nodes at the beginning of the program:
struct hlsl_step { struct hlsl_src *src; unsigned int c; /* used if src is NULL */ };
struct hlsl_deref { struct hlsl_ir_var *var;
unsigned int path_len; struct hlsl_step *path;
};
I don't like this idea too much, but it also covers for the field accesses that Giovanni mentioned.
b) In my patch, accessing matrix components requires paths of length 2 (which I think is the right thing to do) so the version of the functions that receive a hlsl_deref should also receive 2 optional nodes instead of 1. So for instance, hlsl_new_store would be:
struct hlsl_ir_store *hlsl_new_store(struct hlsl_ctx *ctx, struct hlsl_ir_var *var, const hlsl_deref *deref, struct hlsl_ir_node *node1, struct hlsl_ir_node *node2, struct hlsl_ir_node *rhs, unsigned int writemask, struct vkd3d_shader_location loc);
this adds a little clutter.
c) We also have to change other functions, for instance, add_load() in upstream adds an arbitrary register offset to an hlsl_deref offset. In the patch, this generality is achieved by concatenating paths. Given the current uses of add_load() we would also need to create two versions of this function, the "hlsl_deref + 2 optional nodes" one and the "component index" one, and solve this particular case of problem (a) too.
Sure, we'd have to split add_load() into two helpers too. I'm inclined to say that's worthwhile, though.
I am not sure that loosing the generality of using path_arg is a good thing, it can simplify some calls to loads and stores but that's the only benefit I see (unless I am missing something). I can switch to this approach though, if we agree in a way of solving (a).
The problem with path_arg is that it's another layer of abstraction that has to be understood. It'd be nice if we could avoid that, or at least avoid letting it leak out of hlsl.c.
One obvious way to solve (a) and (b) is to replace hlsl_new_load() with a hlsl_add_load() that explicitly takes an instruction list and adds to the end. I think we had even discussed doing something similar to this previously, to save a bit of effort in the common new + insert case. This should cover most cases, but I'm not sure it covers all cases—there are places in hlsl_codegen.c that we use list_add_before(), list_add_after(), list_add_head().
What we could do instead, to make sure all cases are covered, is something like
bool hlsl_new_xyzzy(struct hlsl_block *block, unsigned int c, ...) { struct hlsl_ir_constant *constant; struct hlsl_ir_xyzzy *xyzzy;
list_init(&block->instrs);
constant = hlsl_new_uint_constant(c); list_add_tail(&block->instrs, c);
xyzzy = hlsl_alloc(...); init_node(&xyzzy);
return true; /* or xyzzy for convenience? */ }
and then
{ struct hlsl_block block;
hlsl_new_xyzzy(&block, 123, ...); list_move_{head,tail,before,after}(..., &block->instrs); }
the idea basically being to return a list of instructions in a caller-allocated container.
I think there's been at least one other time I've wanted to do something like this, although I don't know what that was now.
Note that this step could also be done independently of anything else, which would help make the "big" patch more reviewable.
You mean something like adding a "Replace path_arg with specific load and store functions." patch afterwards?
I mean you can have one patch (or multiple patches) that change the hlsl_new_load() and hlsl_new_store() functions like I describe above, and then another patch (i.e. the one attached to your email, mostly) that changes the internal representation of hlsl_derefs from a register offset to a component chain.
- Similarly I think hlsl_new_resource_load() and friends should just
take a "const struct hlsl_deref *" or two.
Yes, this seems consistent if we do stores and loads this way.
I'm not immediately sure what to do with prepend_input_copy() and friends—maybe those should be converted to iterative passes instead of recursive ones.
I don't see any problem keeping them as recursive passes.
Well, if we follow my above suggestion regarding the interface for hlsl_new_load() and hlsl_new_store(), there's no way to pass a whole list of offsets at once. I think this is probably for the better, but it does mean we probably need to split up I/O copies one level at a time.
- Stuff like free_hlsl_deref() can be made into a separate patch.
Hmm, sorry, I am not sure I see a clear division here. What other things could be in this separate patch?
I just mean that you can invent a patch that adds
static void free_hlsl_deref(struct hlsl_deref *deref) { hlsl_src_remove(&deref->offset); }
and then in "this" patch you only need to change the body of free_hlsl_deref(), i.e. one hunk instead of several.
Hello,
I attach a second version of the previous patch, now rebased into the master branch.
I will start working on including the suggestions that Zebediah's made in the other e-mails.
Besides those, I will also check if we can remove the "type" argument from hlsl_new_load(), I think it can be deducted from the path and the variable.
Best regards, Francisco.
On 14-06-22 23:05, Francisco Casas wrote:
Hello,
On 13-06-22 11:59, Francisco Casas wrote:
Okay, I stopped working on multiple register offsets and now I am working on this approach.
I still think that adding a pass to translate these "route"s to the original register "offset"s, so that we can implement this change gradually, is good. We can move the pass forward as we change more passes to work with component offsets. And we can debug more easily the things that we don't translate correctly.
I am aiming to implement the new approach until right after split_matrix_copies, and put the translation afterwards. So far it seems to be going nicely.
I attach a patch that achieves this, in case there are any opinions.
The "paths"[1] of component indexes seem like the best way to solve the problem after all. I think that keeping them as an arrays of nodes in the hlsl_deref allows to write more uniform code than using a more complex data structure, as Giovanni mentioned.
IMO, I think this patch manages matrices nicely too, but there is the problem that it is based on top of
bb49bdba gmascellani vkd3d-shader/hlsl: Allow majority modifiers on function declarations.
which is a little behind of master.
So I still have to rebase it on top of current master, and in particular on top of the recent patches pertaining matrices. So I will probably have to make some design decisions while solving the conflicts.
I guess that the main objective is that we don't want to deal with matrices at all after splitting matrix copies.
Best regards, Francisco.
[1] I hope that this is a better name than "route".
Hello,
I attach the third version of the previous patch, applying the suggestions made by Zeb, and some other changes that arose from them.
The patch is based on the master branch, but I guess I will have to update it if the recent patches from Giovanni are accepted, before sending it for review.
Best regards, Francisco.
Aaaand, I forgot to attach
02224576 zfigura vkd3d-shader/hlsl: Store the struct fields as an array.
before that one. Sorry.
On 25-06-22 00:46, Francisco Casas wrote:
Hello,
I attach the third version of the previous patch, applying the suggestions made by Zeb, and some other changes that arose from them.
The patch is based on the master branch, but I guess I will have to update it if the recent patches from Giovanni are accepted, before sending it for review.
Best regards, Francisco.
On 6/24/22 23:46, Francisco Casas wrote:
Hello,
I attach the third version of the previous patch, applying the suggestions made by Zeb, and some other changes that arose from them.
The patch is based on the master branch, but I guess I will have to update it if the recent patches from Giovanni are accepted, before sending it for review.
Is it not possible to split up this patch, as I had asked earlier?
There's also a couple things I notice from skimming:
* I think add_load() should also be split into deref / component versions. Most of its users only need to count components.
* Actually, should we be passing a deref instead of just a var to hlsl_new_component_load()? See e.g. add_cast(). That would make it a simple wrapper around hlsl_new_load_from_deref() [and the names would also need to be changed...] I'm not sure why I thought it made sense like this, frankly.
* Do we need to always split up matrices into scalars? Should we make non-contiguous loads a problem for parse time? As long as we're trying to keep backend-specific details out of the HLSL IR, that seems potentially reasonable; it'd just require emitting multiple copy instructions for some cases. That would allow us to simplify split_copy() and add_matrix_scalar_load(), and I believe would thus also get rid of the need to pass multiple offsets to hlsl_new_load_from_deref().
Hello,
On 25-06-22 13:29, Zebediah Figura wrote:
On 6/24/22 23:46, Francisco Casas wrote:
Hello,
I attach the third version of the previous patch, applying the suggestions made by Zeb, and some other changes that arose from them.
The patch is based on the master branch, but I guess I will have to update it if the recent patches from Giovanni are accepted, before sending it for review.
Is it not possible to split up this patch, as I had asked earlier?
I think I can separate some small parts, like hlsl_free_deref() as you suggested, but we would still have a large patch that's hard to split.
There's also a couple things I notice from skimming:
- I think add_load() should also be split into deref / component
versions. Most of its users only need to count components.
add_record_load() and add_array_load() will require indexes, but yes, add_matrix_scalar_load(), intrinsic_mul(), add_expr(), and add_cast(), in particular the parts where they deal with matrices, could be simplified if we use component offsets.
I will split the function then.
- Actually, should we be passing a deref instead of just a var to
hlsl_new_component_load()? See e.g. add_cast(). That would make it a simple wrapper around hlsl_new_load_from_deref() [and the names would also need to be changed...] I'm not sure why I thought it made sense like this, frankly.
Allowing to pass a deref instead of just a var to hlsl_new_component_load() seems like a good idea. It may save us from creating synthetic variables e.g. in initialize_var_components().
I don't think it can be just made a simple wrapper around hlsl_new_load_from_deref() though, since a single component within a struct may require to be accessed with path of larger length than 2. It requires a little more complex implementation, but it can be done.
- Do we need to always split up matrices into scalars? Should we make
non-contiguous loads a problem for parse time? As long as we're trying to keep backend-specific details out of the HLSL IR, that seems potentially reasonable; it'd just require emitting multiple copy instructions for some cases. That would allow us to simplify split_copy() and add_matrix_scalar_load(), and I believe would thus also get rid of the need to pass multiple offsets to hlsl_new_load_from_deref().
If we go back to make non-contiguous loads a problem for parse time, then I think it would make sense for row_major matrices to retrieve a row, and column_matrices to retrieve a column when indexing with a path.
I will see if I can do it while keeping the component offsets consistent with the order in the initializers.
Best regards, Francisco.
On 6/27/22 11:54, Francisco Casas wrote:
Hello,
On 25-06-22 13:29, Zebediah Figura wrote:
On 6/24/22 23:46, Francisco Casas wrote:
Hello,
I attach the third version of the previous patch, applying the suggestions made by Zeb, and some other changes that arose from them.
The patch is based on the master branch, but I guess I will have to update it if the recent patches from Giovanni are accepted, before sending it for review.
Is it not possible to split up this patch, as I had asked earlier?
I think I can separate some small parts, like hlsl_free_deref() as you suggested, but we would still have a large patch that's hard to split.
Sure. Every little bit helps, though.
I think if we introduce hlsl_new_component_load() and hlsl_new_load_from_deref() [or whatever they end up being called] in a separate patch (or patches) that'll do a lot to help review.
There's also a couple things I notice from skimming:
- I think add_load() should also be split into deref / component
versions. Most of its users only need to count components.
add_record_load() and add_array_load() will require indexes, but yes, add_matrix_scalar_load(), intrinsic_mul(), add_expr(), and add_cast(), in particular the parts where they deal with matrices, could be simplified if we use component offsets.
I will split the function then.
- Actually, should we be passing a deref instead of just a var to
hlsl_new_component_load()? See e.g. add_cast(). That would make it a simple wrapper around hlsl_new_load_from_deref() [and the names would also need to be changed...] I'm not sure why I thought it made sense like this, frankly.
Allowing to pass a deref instead of just a var to hlsl_new_component_load() seems like a good idea. It may save us from creating synthetic variables e.g. in initialize_var_components().
I don't think it can be just made a simple wrapper around hlsl_new_load_from_deref() though, since a single component within a struct may require to be accessed with path of larger length than 2. It requires a little more complex implementation, but it can be done.
Ah, right, of course.
- Do we need to always split up matrices into scalars? Should we make
non-contiguous loads a problem for parse time? As long as we're trying to keep backend-specific details out of the HLSL IR, that seems potentially reasonable; it'd just require emitting multiple copy instructions for some cases. That would allow us to simplify split_copy() and add_matrix_scalar_load(), and I believe would thus also get rid of the need to pass multiple offsets to hlsl_new_load_from_deref().
If we go back to make non-contiguous loads a problem for parse time, then I think it would make sense for row_major matrices to retrieve a row, and column_matrices to retrieve a column when indexing with a path.
I will see if I can do it while keeping the component offsets consistent with the order in the initializers.
Right. To be clear I'm not *sure* if this is the best approach, but it seems possible.
Hello,
JIC, I attach the most recent version of the patch. I still have to split it, but I addressed all the other suggestions.
Best regards, Francisco.
On 27-06-22 18:05, Zebediah Figura wrote:
On 6/27/22 11:54, Francisco Casas wrote:
Hello,
On 25-06-22 13:29, Zebediah Figura wrote:
On 6/24/22 23:46, Francisco Casas wrote:
Hello,
I attach the third version of the previous patch, applying the suggestions made by Zeb, and some other changes that arose from them.
The patch is based on the master branch, but I guess I will have to update it if the recent patches from Giovanni are accepted, before sending it for review.
Is it not possible to split up this patch, as I had asked earlier?
I think I can separate some small parts, like hlsl_free_deref() as you suggested, but we would still have a large patch that's hard to split.
Sure. Every little bit helps, though.
I think if we introduce hlsl_new_component_load() and hlsl_new_load_from_deref() [or whatever they end up being called] in a separate patch (or patches) that'll do a lot to help review.
There's also a couple things I notice from skimming:
- I think add_load() should also be split into deref / component
versions. Most of its users only need to count components.
add_record_load() and add_array_load() will require indexes, but yes, add_matrix_scalar_load(), intrinsic_mul(), add_expr(), and add_cast(), in particular the parts where they deal with matrices, could be simplified if we use component offsets.
I will split the function then.
- Actually, should we be passing a deref instead of just a var to
hlsl_new_component_load()? See e.g. add_cast(). That would make it a simple wrapper around hlsl_new_load_from_deref() [and the names would also need to be changed...] I'm not sure why I thought it made sense like this, frankly.
Allowing to pass a deref instead of just a var to hlsl_new_component_load() seems like a good idea. It may save us from creating synthetic variables e.g. in initialize_var_components().
I don't think it can be just made a simple wrapper around hlsl_new_load_from_deref() though, since a single component within a struct may require to be accessed with path of larger length than 2. It requires a little more complex implementation, but it can be done.
Ah, right, of course.
- Do we need to always split up matrices into scalars? Should we make
non-contiguous loads a problem for parse time? As long as we're trying to keep backend-specific details out of the HLSL IR, that seems potentially reasonable; it'd just require emitting multiple copy instructions for some cases. That would allow us to simplify split_copy() and add_matrix_scalar_load(), and I believe would thus also get rid of the need to pass multiple offsets to hlsl_new_load_from_deref().
If we go back to make non-contiguous loads a problem for parse time, then I think it would make sense for row_major matrices to retrieve a row, and column_matrices to retrieve a column when indexing with a path.
I will see if I can do it while keeping the component offsets consistent with the order in the initializers.
Right. To be clear I'm not *sure* if this is the best approach, but it seems possible.
Hi,
I finally managed to read through your messages, and it seems that we have a sort of agreement for trying with structured dereferences, which I agree with too.
Il 10/06/22 23:13, Zebediah Figura ha scritto:
On the other hand, the version of (b) I imagine changes hlsl_deref to:
struct hlsl_deref { struct hlsl_ir_var *var; unsigned int route_len; struct hlsl_src *route; };
Where route is intended to be a variable size array of component offsets.
I was envisioning something more explicit, but this is simpler, so my guess is that this is what we want.
I can live with Francisco's proposal, but my preference would be for something like:
enum hlsl_selector_type { HLSL_SELECTOR_SUBSCRIPT, HLSL_SELECTOR_FIELD, };
struct hlsl_selector { enum hlsl_selector_type type; union { hlsl_src *subscript; unsigned int field; }; };
struct hlsl_deref { struct hlsl_ir_var *var; unsigned int path_len; struct hlsl_selector *path; };
This way field indices are kept as actual numbers, not other nodes that have to resolve to a constant uint value. In general I like data types to constrain as much as possible to the valid values that you can store in them (and I don't think there is any way in HLSL to make a dynamic reference to a field of a structure, like it is possible in C++).
Giovanni.
On 6/14/22 09:49, Giovanni Mascellani wrote:
Hi,
I finally managed to read through your messages, and it seems that we have a sort of agreement for trying with structured dereferences, which I agree with too.
Il 10/06/22 23:13, Zebediah Figura ha scritto:
On the other hand, the version of (b) I imagine changes hlsl_deref to:
struct hlsl_deref { struct hlsl_ir_var *var; unsigned int route_len; struct hlsl_src *route; };
Where route is intended to be a variable size array of component offsets.
I was envisioning something more explicit, but this is simpler, so my guess is that this is what we want.
I can live with Francisco's proposal, but my preference would be for something like:
enum hlsl_selector_type { HLSL_SELECTOR_SUBSCRIPT, HLSL_SELECTOR_FIELD, };
struct hlsl_selector { enum hlsl_selector_type type; union { hlsl_src *subscript; unsigned int field; }; };
struct hlsl_deref { struct hlsl_ir_var *var; unsigned int path_len; struct hlsl_selector *path; };
This way field indices are kept as actual numbers, not other nodes that have to resolve to a constant uint value. In general I like data types to constrain as much as possible to the valid values that you can store in them (and I don't think there is any way in HLSL to make a dynamic reference to a field of a structure, like it is possible in C++).
I agree in principle, but this does make the structure more complex to deal with. I'm inclined to stick to Francisco's proposal for now and just store it as a constant either way. We could certainly revisit it later, though.