Hello,
I write this update on what I am working on, because I think it introduces some architectural changes that may be good to discuss, or at least mention.
Sorry if this text is a bit dense, I hope to send all this as a merge request in a couple of days from now. Albeit we may want to avoid merging before the code freeze ends.
More than a week ago, I started trying to implement SM1 resource loads. For this, in particular, it is necessary to infer each sampler's dimensions from usage (a use in a tex2D implies that the sampler is 2D, if not previously declared as such). This gets complicated when there are arrays of samplers, because each sampler component can have a different dimension, and yet being part of the same variable.
For solving this I tried introducing a field to allow us to store data for each component within hlsl_ir_var. But this turned to be a complicated solution. As now, we lost the component information in the derefs once we start working with register offsets.
I also wanted to have per-component data because I realized that, in general, we are not handling object components properly when they are part of a larger array/struct.
But then I thought on an alternative solution: adding a compiler pass to promote each object component within a large variable to a new standalone variable, and prepend a store from this standalone variable to the the original path of the component within the large variable; relying on copy-prop and dce so that all references that refer to these components end up pointing to the new standalone variable.
This allows us to use the fields of the hlsl_ir_var (in particular, the register allocation) for each object component, which solves the problem.
While that strategy works, I then realized that we cannot use it when targeting SM1 profiles, because we wouldn't be replicating the CTAB (which doesn't introduce a new variable for each component in multi-dimensional object arrays, just increases its register size). SM4 on the other hand seems to do this, separating each sampler and texture as its own variable in the RDEF block.
My conclusion then, is that it is a good idea to separate each object component into its own variable in SM4 (as long they are named with the correct subscripts, like "foo.tex[3]"). And while we can't do that in SM1, the good news is that SM1 doesn't allow objects as components of larger types except within (possible multi-dimensional) arrays.
Samplers and Textures cannot be components of structs within SM1 profiles, so we only have to support those cases in hlsl_sm1.c.
This approach also has the benefit that now, all the component register allocation information should be representable using the fields in the hlsl_ir_var struct and the register size data within hlsl_type [1], since each variable should only care about a single register type now. Storing register allocation data component-wise should not be necessary.
So I am implementing a series of patches to ensure that:
* When targeting SM4 profiles, all object components are separated into a sole variable. * When targeting SM1 profiles, it is not allowed to declare objects as components of structs; so they will either be a sole variable, or belong to either a (possibly multi-dimensional) array of elements of the same type, as declared within the shader. * SM1 resource loads work.
My plan for this series, in terms of patches is:
* Parse the tex3D() intrinsic. (already made by zf) * Parse the tex2D() intrinsic. (already made by zf) * Validate that object are not components of structs in SM1. * Properly allocate registers for object arrays for SM1. * Infer sampler register dimensions and write declarations in SM1. * Write resource loads in SM1. * Separate objects as standalone variables in SM4. * Lower combined samplers to separate sampler and texture objects for SM4. (already made by zf) * Lower separate sampler and texture objects to combined samplers for SM1. (already made by zf)
It remains to check if object arrays are handled properly in the last two [2], or modify these patches accordingly.
Best regards, Francisco.
---
[1] I know we intend to remove this field later, when we move register allocation to each sm*_write.c, but I think we probably want to replace those precomputed register offset fields with equivalent SM-specific functions that retrieve the register offset for each component.
[2] In SM1 profiles, when there are texture arrays and they are used together with samplers, only one variable is created (not one for each pair of components).
This is allowed: ``` Texture2D tex[3][2]; sampler sam;
float4 main() : SV_TARGET { return tex[0][1].Sample(sam, float2(1, 2)) + tex[1][0].Sample(sam, float2(1, 2)); } ```
This is allowed (but "not yet implemented" in fxc 9 and 10) ``` Texture2D tex; sampler sam[2][3];
float4 main() : SV_TARGET { return tex.Sample(sam[0][2], float2(1, 2)) + tex.Sample(sam[1][0], float2(1, 2)); } ```
This is not allowed: ``` Texture2D tex[3]; sampler sam[4];
float4 PSMain() : SV_TARGET { return tex[1].Sample(sam[1], float2(1, 2)) + tex[2].Sample(sam[0], float2(1, 2)); } ``` gives: ``` error X4581: Cannot use texture arrays on DX9 targets with multiple samplers. ```
On 8/30/22 18:05, Francisco Casas wrote:
Hello,
I write this update on what I am working on, because I think it introduces some architectural changes that may be good to discuss, or at least mention.
Sorry if this text is a bit dense, I hope to send all this as a merge request in a couple of days from now. Albeit we may want to avoid merging before the code freeze ends.
More than a week ago, I started trying to implement SM1 resource loads. For this, in particular, it is necessary to infer each sampler's dimensions from usage (a use in a tex2D implies that the sampler is 2D, if not previously declared as such). This gets complicated when there are arrays of samplers, because each sampler component can have a different dimension, and yet being part of the same variable.
For solving this I tried introducing a field to allow us to store data for each component within hlsl_ir_var. But this turned to be a complicated solution. As now, we lost the component information in the derefs once we start working with register offsets.
I also wanted to have per-component data because I realized that, in general, we are not handling object components properly when they are part of a larger array/struct.
But then I thought on an alternative solution: adding a compiler pass to promote each object component within a large variable to a new standalone variable, and prepend a store from this standalone variable to the the original path of the component within the large variable; relying on copy-prop and dce so that all references that refer to these components end up pointing to the new standalone variable.
This allows us to use the fields of the hlsl_ir_var (in particular, the register allocation) for each object component, which solves the problem.
While that strategy works, I then realized that we cannot use it when targeting SM1 profiles, because we wouldn't be replicating the CTAB (which doesn't introduce a new variable for each component in multi-dimensional object arrays, just increases its register size). SM4 on the other hand seems to do this, separating each sampler and texture as its own variable in the RDEF block.
My conclusion then, is that it is a good idea to separate each object component into its own variable in SM4 (as long they are named with the correct subscripts, like "foo.tex[3]"). And while we can't do that in SM1, the good news is that SM1 doesn't allow objects as components of larger types except within (possible multi-dimensional) arrays.
Samplers and Textures cannot be components of structs within SM1 profiles, so we only have to support those cases in hlsl_sm1.c.
This approach also has the benefit that now, all the component register allocation information should be representable using the fields in the hlsl_ir_var struct and the register size data within hlsl_type [1], since each variable should only care about a single register type now. Storing register allocation data component-wise should not be necessary.
So I am implementing a series of patches to ensure that:
- When targeting SM4 profiles, all object components are separated into
a sole variable.
- When targeting SM1 profiles, it is not allowed to declare objects as
components of structs; so they will either be a sole variable, or belong to either a (possibly multi-dimensional) array of elements of the same type, as declared within the shader.
- SM1 resource loads work.
I don't think I understand the point of splitting uniforms for sm4 but not sm1. If we need to determine (and potentially store) the actual type of a uniform per-component anyway, why would we need to do that differently for structs and arrays?
On 31-08-22 14:12, Zebediah Figura wrote:
I don't think I understand the point of splitting uniforms for sm4 but not sm1. If we need to determine (and potentially store) the actual type of a uniform per-component anyway, why would we need to do that differently for structs and arrays?
Well, first, I don't think we should split uniforms completely, only separate the object components as standalone variables.
So, if we have, say
struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
We would be effectively end up with 3 variables:
The original:
--- struct { float4 foo; Texture2D tex[2]; float4 bar; } pou; ---
and:
--- Texture2D "pou.tex[0]"; Texture2D "pou.tex[1]"; ---
because we would be adding a store from the "pou.tex[0]" variable to pou.tex[0], and from the "pou.tex[1]" variable to pou.tex[1], copy-prop should replace all derefs to the Texture components to derefs to these new variables.
The pou variable no longer has to worry about the register allocation of its Texture fields, because the new variables do that.
Generalizing, variables should no longer care about their object components, unless they are objects themselves.
I think something similar happens under the hood in SM4, consider the output of the following shader in the native compiler:
--- struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
float4 main() : SV_TARGET { return pou.bar + pou.tex[0].Load(int3(1, 2, 3)) + pou.tex[1].Load(int3(1, 2, 3)); } ---
--- // Buffer Definitions: // // cbuffer $Globals // { // // struct <unnamed> // { // // float4 foo; // Offset: 0 // float4 bar; // Offset: 16 // // } pou; // Offset: 0 Size: 32 // Textures: t0-t1 // // } // // // Resource Bindings: // // Name Type Format Dim HLSL Bind Count // --------------------- ---------- ------- -------- --------- ------ // pou.tex[0] texture float4 2d t0 1 // pou.tex[1] texture float4 2d t1 1 // $Globals cbuffer NA NA cb0 1 ---
Object components are listed separately in the resource bindings, and they are efectively removed (or ignored?) in the definition of pou.
Now, regarding SM1, it is not possible to declare objects inside other components, it is only possible to create (possibly multi-dimensional) arrays of single object types:
--- sampler sam[3][2];
float4 main() : SV_TARGET { return tex2D(sam[2][1], float2(1, 2)); } ---
but they don't appear as different variables in the CTAB section:
--- // Parameters: // // sampler2D sam[6]; // // // Registers: // // Name Reg Size // ------------ ----- ---- // sam s0 6 // ---
Unlike in SM4: --- // Resource Bindings: // // Name Type Format Dim Slot Elements // ------------------------ -------- ------- ------ ---- -------- // sam[2][1] sampler NA NA 5 1 // sam[2][1] texture float4 2d 5 1 --- (to get this output I used fxc 9 with compatibility mode)
So, in summary, my idea is to:
- For SM4, separate object components as stand-alone variables that take care of the register allocation of these individual objects. - For SM1, just support (possibly multi-dimensional) object arrays.
With this, for both SM1 and SM4, each variable should only care about the allocation of a single type of register, so we can use the register offsets as we do now. In the future we would probably want to remove the "offset" field from the derefs, and make each hlsl_sm*.c compute the register offsets from the derefs in their index path form.
Sorry for the late reply; I had to think about this for a while...
On 8/31/22 18:52, Francisco Casas wrote:
On 31-08-22 14:12, Zebediah Figura wrote:
I don't think I understand the point of splitting uniforms for sm4 but not sm1. If we need to determine (and potentially store) the actual type of a uniform per-component anyway, why would we need to do that differently for structs and arrays?
Well, first, I don't think we should split uniforms completely, only separate the object components as standalone variables.
So, if we have, say
struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
We would be effectively end up with 3 variables:
The original:
struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
and:
Texture2D "pou.tex[0]"; Texture2D "pou.tex[1]";
because we would be adding a store from the "pou.tex[0]" variable to pou.tex[0], and from the "pou.tex[1]" variable to pou.tex[1], copy-prop should replace all derefs to the Texture components to derefs to these new variables.
The pou variable no longer has to worry about the register allocation of its Texture fields, because the new variables do that.
Generalizing, variables should no longer care about their object components, unless they are objects themselves.
But this goes both ways. The struct variables still contain multiple object types, and they'll need to be able to allocate one while ignoring the other.
This would make more sense if we were to remove the object types from the struct afterward. It's not perfectly clear to me that we want to do that vs. just keeping them there. A more interesting question is how this compares to cases in sm1 where a variable gets allocated to multiple register sets (float/int/bool).
I think something similar happens under the hood in SM4, consider the output of the following shader in the native compiler:
struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
float4 main() : SV_TARGET { return pou.bar + pou.tex[0].Load(int3(1, 2, 3)) + pou.tex[1].Load(int3(1, 2, 3)); }
// Buffer Definitions: // // cbuffer $Globals // { // // struct <unnamed> // { // // float4 foo; // Offset: 0 // float4 bar; // Offset: 16 // // } pou; // Offset: 0 Size: 32 // Textures: t0-t1 // // } // // // Resource Bindings: // // Name Type Format Dim HLSL Bind Count // --------------------- ---------- ------- -------- --------- ------ // pou.tex[0] texture float4 2d t0 1 // pou.tex[1] texture float4 2d t1 1 // $Globals cbuffer NA NA cb0 1
Object components are listed separately in the resource bindings, and they are efectively removed (or ignored?) in the definition of pou.
This is a more interesting reason. There are two caveats, though, one easy to handwave and one harder.
Firstly, it's worth noting that this doesn't happen with 4.0 or 4.1, only 5.0 (and 5.1 is a different beast anyway). Putting objects in a struct is illegal in 4.x, even if you're not mixing them with numeric types, but you can see a similar difference if you just declare a global array of objects.
Secondly, and more importantly, the original struct actually needs to retain some amount of information about the objects it contained, including which ones were actually used. Consider the following shader:
struct { Texture2D one; Texture2D unused; float3 coords; } apple;
struct { // Texture2D unused; Texture2D three; float3 coords; } banana;
float4 main() : SV_TARGET { return apple.one.Load(apple.coords) + banana.three.Load(banana.coords); }
If you move "unused" to the "banana" struct (as indicated by the comment), the shader itself stays the same, as does the binding table (in the RDEF section), but the constant buffers do not. The difference can be seen in the disassembly output, and corresponds to the "StartTexture" and "TextureSize" fields of D3D11_SHADER_VARIABLE_DESC.
This means, notably, that I don't think we can just remove object variables from their structs.
With this, for both SM1 and SM4, each variable should only care about the allocation of a single type of register, so we can use the register offsets as we do now.
Maybe, but on the other hand, could we just have multiple registers per struct variable? I.e. one per namespace (numeric, sampler, texture, uav...)
In the future we would probably want to remove the "offset" field from the derefs, and make each hlsl_sm*.c compute the register offsets from the derefs in their index path form.
We definitely want to get rid of "offset" from the derefs.
Hello,
Sorry for the late reply too, I had to hit myself against the wall a couple of times to understand this better.
On 08-09-22 00:37, Zebediah Figura wrote:
Sorry for the late reply; I had to think about this for a while...
On 8/31/22 18:52, Francisco Casas wrote:
On 31-08-22 14:12, Zebediah Figura wrote:
I don't think I understand the point of splitting uniforms for sm4 but not sm1. If we need to determine (and potentially store) the actual type of a uniform per-component anyway, why would we need to do that differently for structs and arrays?
Well, first, I don't think we should split uniforms completely, only separate the object components as standalone variables.
So, if we have, say
struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
We would be effectively end up with 3 variables:
The original:
struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
and:
Texture2D "pou.tex[0]"; Texture2D "pou.tex[1]";
because we would be adding a store from the "pou.tex[0]" variable to pou.tex[0], and from the "pou.tex[1]" variable to pou.tex[1], copy-prop should replace all derefs to the Texture components to derefs to these new variables.
The pou variable no longer has to worry about the register allocation of its Texture fields, because the new variables do that.
Generalizing, variables should no longer care about their object components, unless they are objects themselves.
But this goes both ways. The struct variables still contain multiple object types, and they'll need to be able to allocate one while ignoring the other.
This would make more sense if we were to remove the object types from the struct afterward. It's not perfectly clear to me that we want to do that vs. just keeping them there. A more interesting question is how this compares to cases in sm1 where a variable gets allocated to multiple register sets (float/int/bool).
I think something similar happens under the hood in SM4, consider the output of the following shader in the native compiler:
struct { float4 foo; Texture2D tex[2]; float4 bar; } pou;
float4 main() : SV_TARGET { return pou.bar + pou.tex[0].Load(int3(1, 2, 3)) + pou.tex[1].Load(int3(1, 2, 3)); }
// Buffer Definitions: // // cbuffer $Globals // { // // struct <unnamed> // { // // float4 foo; // Offset: 0 // float4 bar; // Offset: 16 // // } pou; // Offset: 0 Size: 32 // Textures: t0-t1 // // } // // // Resource Bindings: // // Name Type Format Dim HLSL Bind Count // --------------------- ---------- ------- -------- --------- ------ // pou.tex[0] texture float4 2d t0 1 // pou.tex[1] texture float4 2d t1 1 // $Globals cbuffer NA NA cb0 1
Object components are listed separately in the resource bindings, and they are efectively removed (or ignored?) in the definition of pou.
This is a more interesting reason. There are two caveats, though, one easy to handwave and one harder.
Firstly, it's worth noting that this doesn't happen with 4.0 or 4.1, only 5.0 (and 5.1 is a different beast anyway). Putting objects in a struct is illegal in 4.x, even if you're not mixing them with numeric types, but you can see a similar difference if you just declare a global array of objects.
Secondly, and more importantly, the original struct actually needs to retain some amount of information about the objects it contained, including which ones were actually used. Consider the following shader:
struct { Texture2D one; Texture2D unused; float3 coords; } apple;
struct { // Texture2D unused; Texture2D three; float3 coords; } banana;
float4 main() : SV_TARGET { return apple.one.Load(apple.coords) + banana.three.Load(banana.coords); }
If you move "unused" to the "banana" struct (as indicated by the comment), the shader itself stays the same, as does the binding table (in the RDEF section), but the constant buffers do not. The difference can be seen in the disassembly output, and corresponds to the "StartTexture" and "TextureSize" fields of D3D11_SHADER_VARIABLE_DESC.
This means, notably, that I don't think we can just remove object
variables from their structs.
More so, the resource bindings actually change too. If "unused" is added to the banana struct, banana.three is bind to 't3' instead of 't2'.
So far I have noticed that if only a single texture within a struct/array is used, all of the textures within that struct/array are allocated.
Sampler components are a little different; they are allocated from the first one to the last one used.
In both cases it is indeed necessary to keep track of the usage of object components, this is also a requirement for the inference of SM1 sampler dimensions.
With this, for both SM1 and SM4, each variable should only care about the allocation of a single type of register, so we can use the register offsets as we do now.
Maybe, but on the other hand, could we just have multiple registers per struct variable? I.e. one per namespace (numeric, sampler, texture, uav...)
Well, after trying many things, and checking your example ahead, I realized that indeed we have to allow variables to span across multiple register spaces if we want to correctly imitate the register indexes assigned to each object by the native compiler.
After going for the approach of separating object type components as different variables, so that each variable is allocated in a single register space, my hope was being able to determine which object components needed to be allocated before the point where they were split into separate variables. That proved to be quite difficult without having to call copy propagation and other passes twice. Also, keeping track of the struct's sampler components usage and then transfer that information to the new variables ended up a little ugly.
At least now I am more familiar with register allocation.
I have been some days implementing a proposal for allocating variables in multiple register spaces. Sadly, this requires changing a lot of code, I am trying to do it cleanly, though. I am also translating my SM1 resource loads and long list of related patches to this approach to see if it fits those well.
In the future we would probably want to remove the "offset" field from the derefs, and make each hlsl_sm*.c compute the register offsets from the derefs in their index path form.
We definitely want to get rid of "offset" from the derefs.
Regarding this, after some thought I am actually not sure. Consider a complex non-constant offset dereference such as 'foo[n][m]' in:
--- unsigned int n, m;
float4 main() : sv_target { float4 foo[3][4] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
return foo[n][m]; } ---
So far I haven't seen these being represented with a complex expression within the index of an instruction's register (I doubt that register indexes can be more complex than a reference to another register plus a constant offset).
Instead, a single temporal register is created and then referenced (r0 in this case):
--- ishl r0.x, cb0[0].x, l(2) iadd r0.x, r0.x, cb0[0].y mov r0.x, x0[r0.x + 0].x mov o0.xyzw, r0.xxxx ---
So it is reasonable for me to keep the deref offset instruction to serve this purpose.
I suggest we review this again after once I send this patch series as a merge request.
Best regards, Francisco
On 9/20/22 09:43, Francisco Casas wrote:
In the future we would probably want to remove the "offset" field from the derefs, and make each hlsl_sm*.c compute the register offsets from the derefs in their index path form.
We definitely want to get rid of "offset" from the derefs.
Regarding this, after some thought I am actually not sure. Consider a complex non-constant offset dereference such as 'foo[n][m]' in:
unsigned int n, m;
float4 main() : sv_target { float4 foo[3][4] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
return foo[n][m]; }
So far I haven't seen these being represented with a complex expression within the index of an instruction's register (I doubt that register indexes can be more complex than a reference to another register plus a constant offset).
Instead, a single temporal register is created and then referenced (r0 in this case):
ishl r0.x, cb0[0].x, l(2) iadd r0.x, r0.x, cb0[0].y mov r0.x, x0[r0.x + 0].x mov o0.xyzw, r0.xxxx
So it is reasonable for me to keep the deref offset instruction to serve this purpose.
I suggest we review this again after once I send this patch series as a merge request.
Well, when generating sm1 or sm4 IR we are going to need to be able to construct arbitrarily complex offsets, possibly involving new registers. In a sense we're going to have to generate the multiplication and addition instructions at translation time, instead of earlier as now. In essence I expect write_sm4_load() to generate almost exactly those instructions (modulo ishl vs imul or copy-prop optimizations...)