If a hlsl_ir_load loads a variable whose components are stored from different instructions, copy propagation doesn't replace it.
But if all these instructions are constants (which currently is the case for value constructors), the load can be replaced with a constant value, which is what the first patch of this series does.
For instance, this shader:
``` sampler s; Texture2D t;
float4 main() : sv_target { return t.Gather(s, float2(0.6, 0.6), int2(0, 0)); } ```
results in the following IR before applying the patch: ``` float | 6.00000024e-01 float | 6.00000024e-01 uint | 0 | = (<constructor-2>[@4].x @2) uint | 1 | = (<constructor-2>[@6].x @3) float2 | <constructor-2> int | 0 int | 0 uint | 0 | = (<constructor-5>[@11].x @9) uint | 1 | = (<constructor-5>[@13].x @10) int2 | <constructor-5> float4 | gather_red(resource = t, sampler = s, coords = @8, offset = @15) | return | = (<output-sv_target0> @16) ```
and this IR afterwards: ``` float2 | {6.00000024e-01 6.00000024e-01 } int2 | {0 0 } float4 | gather_red(resource = t, sampler = s, coords = @2, offset = @3) | return | = (<output-sv_target0> @4) ```
This is required to write texel_offsets as aoffimmi modifiers in the sm4 backend, since it expects the texel_offset arguments to be hlsl_ir_constant.
This series also: * Allows Gather() methods to use aoffimmi modifiers instead of an additional source register (which is the only way allowed for shader model 4.1), when possible. * Adds support to texel_offsets in the Load() method via aoffimmi modifiers (the only allowed method).
From: Francisco Casas fcasas@codeweavers.com
If a hlsl_ir_load loads a variable whose components are stored from different instructions, copy propagation doesn't replace it.
But if all these instructions are constants (which currently is the case for value constructors), the load could be replaced with a constant value. Which is expected in some other instructions, e.g. texel_offsets when using aoffimmi modifiers.
For instance, this shader:
``` sampler s; Texture2D t;
float4 main() : sv_target { return t.Gather(s, float2(0.6, 0.6), int2(0, 0)); } ```
results in the following IR before applying the patch: ``` float | 6.00000024e-01 float | 6.00000024e-01 uint | 0 | = (<constructor-2>[@4].x @2) uint | 1 | = (<constructor-2>[@6].x @3) float2 | <constructor-2> int | 0 int | 0 uint | 0 | = (<constructor-5>[@11].x @9) uint | 1 | = (<constructor-5>[@13].x @10) int2 | <constructor-5> float4 | gather_red(resource = t, sampler = s, coords = @8, offset = @15) | return | = (<output-sv_target0> @16) ```
and this IR afterwards: ``` float2 | {6.00000024e-01 6.00000024e-01 } int2 | {0 0 } float4 | gather_red(resource = t, sampler = s, coords = @2, offset = @3) | return | = (<output-sv_target0> @4) ``` --- libs/vkd3d-shader/hlsl_codegen.c | 42 ++++++++++++++++++++++ tests/hlsl-initializer-objects.shader_test | 8 ++--- tests/object-references.shader_test | 6 ++-- tests/sampler-offset.shader_test | 12 +++---- tests/shader_runner_d3d12.c | 2 +- 5 files changed, 56 insertions(+), 14 deletions(-)
diff --git a/libs/vkd3d-shader/hlsl_codegen.c b/libs/vkd3d-shader/hlsl_codegen.c index 58d44c4d..92da9c49 100644 --- a/libs/vkd3d-shader/hlsl_codegen.c +++ b/libs/vkd3d-shader/hlsl_codegen.c @@ -718,6 +718,41 @@ static struct hlsl_ir_node *copy_propagation_compute_replacement(struct hlsl_ctx return instr; }
+static struct hlsl_ir_node *copy_propagation_compute_load_constant_replacement(struct hlsl_ctx *ctx, + const struct copy_propagation_state *state, const struct hlsl_ir_load *load) +{ + const struct hlsl_ir_var *var = load->src.var; + union hlsl_constant_value values[4] = {0}; + struct hlsl_ir_constant *cons; + unsigned int start, count, i; + + if (load->node.data_type->type != HLSL_CLASS_SCALAR && load->node.data_type->type != HLSL_CLASS_VECTOR) + return NULL; + + if (!hlsl_component_index_range_from_deref(ctx, &load->src, &start, &count)) + return NULL; + + for (i = 0; i < count; ++i) + { + struct copy_propagation_value *value = copy_propagation_get_value(state, var, start + i); + + if (!value || value->node->type != HLSL_IR_CONSTANT) + return NULL; + + values[i] = hlsl_ir_constant(value->node)->value[value->component]; + } + + if (!(cons = hlsl_new_constant(ctx, load->node.data_type, &load->node.loc))) + return NULL; + cons->value[0] = values[0]; + cons->value[1] = values[1]; + cons->value[2] = values[2]; + cons->value[3] = values[3]; + + TRACE("Load from %s[%u-%u] turned into a constant %p.\n", var->name, start, start + count, cons); + return &cons->node; +} + static bool copy_propagation_transform_load(struct hlsl_ctx *ctx, struct hlsl_ir_load *load, struct copy_propagation_state *state) { @@ -746,6 +781,13 @@ static bool copy_propagation_transform_load(struct hlsl_ctx *ctx, return false; }
+ if ((new_instr = copy_propagation_compute_load_constant_replacement(ctx, state, load))) + { + list_add_before(&instr->entry, &new_instr->entry); + hlsl_replace_node(instr, new_instr); + return true; + } + if (!(new_instr = copy_propagation_compute_replacement(ctx, state, &load->src, &swizzle))) return false;
diff --git a/tests/hlsl-initializer-objects.shader_test b/tests/hlsl-initializer-objects.shader_test index d40ede46..d9c0bc91 100644 --- a/tests/hlsl-initializer-objects.shader_test +++ b/tests/hlsl-initializer-objects.shader_test @@ -29,7 +29,7 @@ draw quad probe all rgba (0.2, 0.2, 0.2, 0.1)
-[pixel shader todo] +[pixel shader] Texture2D tex;
struct foo @@ -48,11 +48,11 @@ float4 main() : sv_target }
[test] -todo draw quad -todo probe all rgba (31.1, 41.1, 51.1, 61.1) 1 +draw quad +probe all rgba (31.1, 41.1, 51.1, 61.1) 1
-[pixel shader todo] +[pixel shader] Texture2D tex1; Texture2D tex2;
diff --git a/tests/object-references.shader_test b/tests/object-references.shader_test index a33bd14a..dd50f7cf 100644 --- a/tests/object-references.shader_test +++ b/tests/object-references.shader_test @@ -120,7 +120,7 @@ float4 main() : sv_target }
-[pixel shader todo] +[pixel shader] Texture2D tex; uniform float f;
@@ -141,5 +141,5 @@ float4 main() : sv_target
[test] uniform 0 float 10.0 -todo draw quad -todo probe (0, 0) rgba (11.0, 12.0, 13.0, 11.0) +draw quad +probe (0, 0) rgba (11.0, 12.0, 13.0, 11.0) diff --git a/tests/sampler-offset.shader_test b/tests/sampler-offset.shader_test index 2aa8f9b3..6f8357df 100644 --- a/tests/sampler-offset.shader_test +++ b/tests/sampler-offset.shader_test @@ -12,7 +12,7 @@ size (3, 3) 0.0 0.2 0.0 0.4 0.1 0.2 0.5 0.0 0.2 0.2 0.0 0.4
-[pixel shader todo] +[pixel shader] sampler s; Texture2D t;
@@ -22,11 +22,11 @@ float4 main() : sv_target }
[test] -todo draw quad +draw quad probe all rgba (0.1, 0.2, 0.5, 0.0)
-[pixel shader todo] +[pixel shader] sampler s; Texture2D t;
@@ -36,11 +36,11 @@ float4 main() : sv_target }
[test] -todo draw quad +draw quad probe all rgba (0.2, 0.2, 0.0, 0.4)
-[pixel shader todo] +[pixel shader] sampler s; Texture2D t;
@@ -50,5 +50,5 @@ float4 main() : sv_target }
[test] -todo draw quad +draw quad probe all rgba (0.0, 0.2, 0.0, 0.4) diff --git a/tests/shader_runner_d3d12.c b/tests/shader_runner_d3d12.c index bb4d9c5a..bd94b4c9 100644 --- a/tests/shader_runner_d3d12.c +++ b/tests/shader_runner_d3d12.c @@ -167,7 +167,7 @@ static ID3D12RootSignature *d3d12_runner_create_root_signature(struct d3d12_shad ID3D12GraphicsCommandList *command_list, unsigned int *uniform_index) { D3D12_ROOT_SIGNATURE_DESC root_signature_desc = {0}; - D3D12_ROOT_PARAMETER root_params[3], *root_param; + D3D12_ROOT_PARAMETER root_params[4], *root_param; D3D12_STATIC_SAMPLER_DESC static_samplers[1]; ID3D12RootSignature *root_signature; HRESULT hr;
From: Francisco Casas fcasas@codeweavers.com
--- libs/vkd3d-shader/hlsl_sm4.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/libs/vkd3d-shader/hlsl_sm4.c b/libs/vkd3d-shader/hlsl_sm4.c index ae5bb1ac..4059d618 100644 --- a/libs/vkd3d-shader/hlsl_sm4.c +++ b/libs/vkd3d-shader/hlsl_sm4.c @@ -2110,11 +2110,19 @@ static void write_sm4_gather(struct hlsl_ctx *ctx, struct vkd3d_bytecode_buffer
sm4_src_from_node(&instr.srcs[instr.src_count++], coords, VKD3DSP_WRITEMASK_ALL);
- /* FIXME: Use an aoffimmi modifier if possible. */ if (texel_offset) { - instr.opcode = VKD3D_SM5_OP_GATHER4_PO; - sm4_src_from_node(&instr.srcs[instr.src_count++], texel_offset, VKD3DSP_WRITEMASK_ALL); + if (!encode_texel_offset_as_aoffimmi(&instr, texel_offset)) + { + if (ctx->profile->major_version < 5) + { + hlsl_error(ctx, &texel_offset->loc, VKD3D_SHADER_ERROR_HLSL_INVALID_TEXEL_OFFSET, + "Offset must resolve to integer literal in the range -8 to 7 for profiles < 5."); + return; + } + instr.opcode = VKD3D_SM5_OP_GATHER4_PO; + sm4_src_from_node(&instr.srcs[instr.src_count++], texel_offset, VKD3DSP_WRITEMASK_ALL); + } }
sm4_src_from_deref(ctx, &instr.srcs[instr.src_count++], resource, resource_type, instr.dsts[0].writemask);
From: Francisco Casas fcasas@codeweavers.com
--- Makefile.am | 1 + tests/texture-load-offset.shader_test | 51 +++++++++++++++++++++++++++ 2 files changed, 52 insertions(+) create mode 100644 tests/texture-load-offset.shader_test
diff --git a/Makefile.am b/Makefile.am index 0db71d67..f6c90911 100644 --- a/Makefile.am +++ b/Makefile.am @@ -145,6 +145,7 @@ vkd3d_shader_tests = \ tests/swizzle-6.shader_test \ tests/swizzle-7.shader_test \ tests/texture-load.shader_test \ + tests/texture-load-offset.shader_test \ tests/texture-load-typed.shader_test \ tests/trigonometry.shader_test \ tests/uav.shader_test \ diff --git a/tests/texture-load-offset.shader_test b/tests/texture-load-offset.shader_test new file mode 100644 index 00000000..ab233c58 --- /dev/null +++ b/tests/texture-load-offset.shader_test @@ -0,0 +1,51 @@ +[require] +shader model >= 4.0 + +[texture 0] +size (3, 3) +0 0 0 1 1 0 0 1 2 0 0 1 +0 1 0 1 1 1 0 1 2 1 0 1 +0 2 0 1 1 2 0 1 2 2 0 1 + + +[pixel shader] +Texture2D t; + +float4 main(float4 pos : sv_position) : sv_target +{ + return t.Load(int3(pos.xy, 0), int2(0, 1)); +} + + +[test] +draw quad +todo probe (0, 0) rgba (0, 1, 0, 1) +todo probe (1, 0) rgba (1, 1, 0, 1) +todo probe (0, 1) rgba (0, 2, 0, 1) +todo probe (1, 1) rgba (1, 2, 0, 1) + + +[pixel shader] +Texture2D t; + +float4 main(float4 pos : sv_position) : sv_target +{ + return t.Load(int3(pos.xy, 0), int2(-2, 0)); +} + + +[test] +draw quad +todo probe (3, 0) rgba (1, 0, 0, 1) +todo probe (4, 0) rgba (2, 0, 0, 1) +todo probe (3, 1) rgba (1, 1, 0, 1) +todo probe (4, 1) rgba (2, 1, 0, 1) + + +[pixel shader fail todo] +Texture2D t; + +float4 main(float4 pos : sv_position) : sv_target +{ + return t.Load(int3(pos.xy, 0), int2(8, 1)); +}
From: Francisco Casas fcasas@codeweavers.com
--- libs/vkd3d-shader/hlsl_sm4.c | 16 ++++++++++++++-- tests/texture-load-offset.shader_test | 18 +++++++++--------- 2 files changed, 23 insertions(+), 11 deletions(-)
diff --git a/libs/vkd3d-shader/hlsl_sm4.c b/libs/vkd3d-shader/hlsl_sm4.c index 4059d618..1595e5be 100644 --- a/libs/vkd3d-shader/hlsl_sm4.c +++ b/libs/vkd3d-shader/hlsl_sm4.c @@ -1418,7 +1418,8 @@ static void write_sm4_constant(struct hlsl_ctx *ctx,
static void write_sm4_ld(struct hlsl_ctx *ctx, struct vkd3d_bytecode_buffer *buffer, const struct hlsl_type *resource_type, const struct hlsl_ir_node *dst, - const struct hlsl_deref *resource, const struct hlsl_ir_node *coords) + const struct hlsl_deref *resource, const struct hlsl_ir_node *coords, + const struct hlsl_ir_node *texel_offset) { bool uav = (resource_type->base_type == HLSL_TYPE_UAV); struct sm4_instruction instr; @@ -1427,6 +1428,16 @@ static void write_sm4_ld(struct hlsl_ctx *ctx, struct vkd3d_bytecode_buffer *buf memset(&instr, 0, sizeof(instr)); instr.opcode = uav ? VKD3D_SM5_OP_LD_UAV_TYPED : VKD3D_SM4_OP_LD;
+ if (texel_offset) + { + if (!encode_texel_offset_as_aoffimmi(&instr, texel_offset)) + { + hlsl_error(ctx, &texel_offset->loc, VKD3D_SHADER_ERROR_HLSL_INVALID_TEXEL_OFFSET, + "Offset must resolve to integer literal in the range -8 to 7."); + return; + } + } + sm4_dst_from_node(&instr.dsts[0], dst); instr.dst_count = 1;
@@ -2179,7 +2190,8 @@ static void write_sm4_resource_load(struct hlsl_ctx *ctx, switch (load->load_type) { case HLSL_RESOURCE_LOAD: - write_sm4_ld(ctx, buffer, resource_type, &load->node, &load->resource, coords); + write_sm4_ld(ctx, buffer, resource_type, &load->node, &load->resource, + coords, texel_offset); break;
case HLSL_RESOURCE_SAMPLE: diff --git a/tests/texture-load-offset.shader_test b/tests/texture-load-offset.shader_test index ab233c58..52b6a5f9 100644 --- a/tests/texture-load-offset.shader_test +++ b/tests/texture-load-offset.shader_test @@ -19,10 +19,10 @@ float4 main(float4 pos : sv_position) : sv_target
[test] draw quad -todo probe (0, 0) rgba (0, 1, 0, 1) -todo probe (1, 0) rgba (1, 1, 0, 1) -todo probe (0, 1) rgba (0, 2, 0, 1) -todo probe (1, 1) rgba (1, 2, 0, 1) +probe (0, 0) rgba (0, 1, 0, 1) +probe (1, 0) rgba (1, 1, 0, 1) +probe (0, 1) rgba (0, 2, 0, 1) +probe (1, 1) rgba (1, 2, 0, 1)
[pixel shader] @@ -36,13 +36,13 @@ float4 main(float4 pos : sv_position) : sv_target
[test] draw quad -todo probe (3, 0) rgba (1, 0, 0, 1) -todo probe (4, 0) rgba (2, 0, 0, 1) -todo probe (3, 1) rgba (1, 1, 0, 1) -todo probe (4, 1) rgba (2, 1, 0, 1) +probe (3, 0) rgba (1, 0, 0, 1) +probe (4, 0) rgba (2, 0, 0, 1) +probe (3, 1) rgba (1, 1, 0, 1) +probe (4, 1) rgba (2, 1, 0, 1)
-[pixel shader fail todo] +[pixel shader fail] Texture2D t;
float4 main(float4 pos : sv_position) : sv_target
Giovanni actually sent a very similar patch to 1/4 [1] several months ago. It wasn't accepted, and I'm not sure what discussion there was about it if any (it may have been off-list).
I think that this is, broadly speaking, not a *wrong* solution—and if we care about getting texel offsets working I don't mind accepting it—but I don't think it's quite the right one either. I think there are two other passes that we can do instead that would do everything this pass does and more:
(1) generic vectorization of stores (2) propagation of load+store sequences into multiple store instructions
Consider specifically the following shader:
```hlsl uniform float f;
float4 main() : sv_target { return float4(f, 1, 2, 3); } ```
which generates:
``` 2: float | f 3: uint | 0 4: | = (<constructor-2>[@3].x @2) 5: float | 1.00000000e+00 6: uint | 1 7: | = (<constructor-2>[@6].x @5) 8: float | 2.00000000e+00 9: uint | 2 10: | = (<constructor-2>[@9].x @8) 11: float | 3.00000000e+00 12: uint | 3 13: | = (<constructor-2>[@12].x @11) 14: float4 | <constructor-2> 15: | return 16: | = (<output-sv_target0> @14) ```
Copyprop-with-constant-loads won't help this. However, with the first pass I describe, we can simplify that into:
``` 2: float | f 3: uint | 0 4: | = (<constructor-2>[@3].x @2) 5: float3 | {1.0 2.0 3.0} 6: uint | 1 7: | = (<constructor-2>[@6].xyz @5) 14: float4 | <constructor-2> 15: | return 16: | = (<output-sv_target0> @14) ```
The second pass would recognize when the rhs of a store instruction is just a load, and rewrite it as multiple store instructions. (These passes could be applied in either order, fwiw):
``` 2: float | f 3: uint | 0 4: | = (<output-sv_target0>[@3].x @2) 5: float3 | {1.0 2.0 3.0} 6: uint | 1 7: | = (<output-sv_target0>[@6].xyz @5) ```
(Ignoring the return statement for now. Also, I've automatically applied DCE to these examples.)
Note that this second pass doesn't necessarily have to be limited to whole-variable stores either. (Note also that we can't do the same thing with copy-prop because a load instruction *has* to return the whole vector type; it can't be split.)
This doesn't fully replace copy-prop as it is, to be sure, but I think it replaces everything that patch 1/4 does. Copy-prop is designed to replace X-store-load sequences with X, to oversimplify. If vectorization can turn X into a single instruction, then copy-prop can get rid of the store/load and probably the whole temp variable. If vectorization *can't* do that, then we needed that temp anyway.
[1] https://www.winehq.org/pipermail/wine-devel/2022-May/215773.html
On Sat Nov 19 01:27:25 2022 +0000, Zebediah Figura wrote:
Giovanni actually sent a very similar patch to 1/4 [1] several months ago. It wasn't accepted, and I'm not sure what discussion there was about it if any (it may have been off-list). I think that this is, broadly speaking, not a *wrong* solution—and if we care about getting texel offsets working I don't mind accepting it—but I don't think it's quite the right one either. I think there are two other passes that we can do instead that would do everything this pass does and more: (1) generic vectorization of stores (2) propagation of load+store sequences into multiple store instructions Consider specifically the following shader:
uniform float f; float4 main() : sv_target { return float4(f, 1, 2, 3); }
which generates:
2: float | f 3: uint | 0 4: | = (<constructor-2>[@3].x @2) 5: float | 1.00000000e+00 6: uint | 1 7: | = (<constructor-2>[@6].x @5) 8: float | 2.00000000e+00 9: uint | 2 10: | = (<constructor-2>[@9].x @8) 11: float | 3.00000000e+00 12: uint | 3 13: | = (<constructor-2>[@12].x @11) 14: float4 | <constructor-2> 15: | return 16: | = (<output-sv_target0> @14)
Copyprop-with-constant-loads won't help this. However, with the first pass I describe, we can simplify that into:
2: float | f 3: uint | 0 4: | = (<constructor-2>[@3].x @2) 5: float3 | {1.0 2.0 3.0} 6: uint | 1 7: | = (<constructor-2>[@6].xyz @5) 14: float4 | <constructor-2> 15: | return 16: | = (<output-sv_target0> @14)
The second pass would recognize when the rhs of a store instruction is just a load, and rewrite it as multiple store instructions. (These passes could be applied in either order, fwiw):
2: float | f 3: uint | 0 4: | = (<output-sv_target0>[@3].x @2) 5: float3 | {1.0 2.0 3.0} 6: uint | 1 7: | = (<output-sv_target0>[@6].xyz @5)
(Ignoring the return statement for now. Also, I've automatically applied DCE to these examples.) Note that this second pass doesn't necessarily have to be limited to whole-variable stores either. (Note also that we can't do the same thing with copy-prop because a load instruction *has* to return the whole vector type; it can't be split.) This doesn't fully replace copy-prop as it is, to be sure, but I think it replaces everything that patch 1/4 does. Copy-prop is designed to replace X-store-load sequences with X, to oversimplify. If vectorization can turn X into a single instruction, then copy-prop can get rid of the store/load and probably the whole temp variable. If vectorization *can't* do that, then we needed that temp anyway. [1] https://www.winehq.org/pipermail/wine-devel/2022-May/215773.html
I think I described this second pass poorly. It's actually probably easier to think of it as an *extension* of copy-prop.
The idea is, if you have something like that float4(f, 1, 2, 3) sequence, you can't copy-prop loads from `<constructor-2>` with the current pass because they don't all have the same instruction as a source. So what you do instead is, any time that `<constructor-2>` is itself used as an rhs to a store, you split up that store into multiple stores. You could say one per unique source but you could also do it more simply by saying one per component (and then letting vectorization clean that up later). That way you're guaranteed to be able to copy-prop them. You increase the number of stores but (hopefully) reduce the number of intermediate variables.
Actually, now that I describe it this way, I realize it doesn't even have to be limited to stores, it can be anything that's done per-component, so you could do a similar thing with some HLSL_IR_EXPRs.
Actually, now that I describe it this way, I realize it doesn't even have to be limited to stores, it can be anything that's done per-component, so you could do a similar thing with some HLSL_IR_EXPRs.
Eh, except that the thing that's unique about stores is that you can "just" split them, because they don't produce a result. Expressions do. So ignore that last bit :-)
The idea is, if you have something like that float4(f, 1, 2, 3) sequence, you can't copy-prop loads from `<constructor-2>` with the current pass because they don't all have the same instruction as a source. So what you do instead is, any time that `<constructor-2>` is itself used as an rhs to a store, you split up that store into multiple stores. You could say one per unique source but you could also do it more simply by saying one per component (and then letting vectorization clean that up later). That way you're guaranteed to be able to copy-prop them. You increase the number of stores but (hopefully) reduce the number of intermediate variables.
My way to read your proposal (for the second step) is as a sort of *variable deduplication* (or maybe *dealiasing*?). What's happening in your test program is that `<constructor-2>` and `<output-sv_target0>` are essentially the same variable: as soon as they both are initialized they have the same value, and they will keep it for their whole lifespan. So you can basically replace one with the other one. In theory you can do it either way, but in this case you need `<output-sv_target0>` to survive because it has an externally visible semantic.
Giovanni actually sent a very similar patch to 1/4 [1] several months ago. It wasn't accepted, and I'm not sure what discussion there was about it if any (it may have been off-list).
The reason I remember was more or less the same thing you say now: there are more generic passes that can be written instead of this trick, and in practice it didn't really enable anything relevant, so you proposed to rather wait for somebody to write the proper thing. Since now it could enable the aoffimmi modifiers there could be a better reason to have it immediately (or, rather, after the freeze), even if it's not the best possible solution yet.
Though I am not sure of what you mean by "generic vectorization of stores". The peculiar feature that is exploited by this patch is the fact that immediate constants can be vectorized, i.e., you can replace ``` a.x = 1; a.y = 2; a.z = 3; ``` with ``` a.xyz = {1, 2, 3}; ```
But as I said that's rather a property of immediate constants, rather than of stores. In other words, I would describe a "generic vectorization of stores" as something that replaces ``` a.x = b.x; a.y = b.y; a.z = b.z; ``` with ``` a = b; ```
That's not something you can do in Francisco's example unless you know that immediate constants are special, i.e., you can vectorize them even if they come from different sources. Which is basically what my patch was doing (and what I believe this patch does; I say "believe" because I haven't read it yet in detail). So it seems to me that in one way or another you always need to have something like this patch, even if it would be good to also handle the cases in which only some components are immediate constants.
Or maybe you had already disproved the thing I just wrote and I don't remember any more? I wouldn't be unheard of for my brain to randomly drop pieces of information.
The idea is, if you have something like that float4(f, 1, 2, 3) sequence, you can't copy-prop loads from `<constructor-2>` with the current pass because they don't all have the same instruction as a source. So what you do instead is, any time that `<constructor-2>` is itself used as an rhs to a store, you split up that store into multiple stores. You could say one per unique source but you could also do it more simply by saying one per component (and then letting vectorization clean that up later). That way you're guaranteed to be able to copy-prop them. You increase the number of stores but (hopefully) reduce the number of intermediate variables.
My way to read your proposal (for the second step) is as a sort of *variable deduplication* (or maybe *dealiasing*?). What's happening in your test program is that `<constructor-2>` and `<output-sv_target0>` are essentially the same variable: as soon as they both are initialized they have the same value, and they will keep it for their whole lifespan. So you can basically replace one with the other one. In theory you can do it either way, but in this case you need `<output-sv_target0>` to survive because it has an externally visible semantic.
Kind of, but it's more general than that. In theory the "intermediate" variable (in this case the constructor) can be used elsewhere, and it doesn't have to have the same component count as the "final" one. Although in practice it'll probably just be useful for the more restricted case of equal and "duplicate" variables.
Though I am not sure of what you mean by "generic vectorization of stores". The peculiar feature that is exploited by this patch is the fact that immediate constants can be vectorized, i.e., you can replace
a.x = 1; a.y = 2; a.z = 3;
with
a.xyz = {1, 2, 3};
But as I said that's rather a property of immediate constants, rather than of stores. In other words, I would describe a "generic vectorization of stores" as something that replaces
a.x = b.x; a.y = b.y; a.z = b.z;
with
a = b;
That's not something you can do in Francisco's example unless you know that immediate constants are special, i.e., you can vectorize them even if they come from different sources. Which is basically what my patch was doing (and what I believe this patch does; I say "believe" because I haven't read it yet in detail). So it seems to me that in one way or another you always need to have something like this patch, even if it would be good to also handle the cases in which only some components are immediate constants.
No, I think language is just hard :D
By "vectorization of stores" I meant the store *to* `a` in this example. I suppose it's redundant since it's not like it's really possible to vectorize anything else.
The pass I describe is kind of specific to constants, although there are other vectorization passes you can do that might end up reusing the same infrastructure. E.g. in your second example you can vectorize loads from the same variable. There's also a possibility of vectorizing stores from the same node by creating a broadcast swizzle.
On Tue Nov 22 22:24:35 2022 +0000, Zebediah Figura wrote:
The idea is, if you have something like that float4(f, 1, 2, 3)
sequence, you can't copy-prop loads from `<constructor-2>` with the current pass because they don't all have the same instruction as a source. So what you do instead is, any time that `<constructor-2>` is itself used as an rhs to a store, you split up that store into multiple stores. You could say one per unique source but you could also do it more simply by saying one per component (and then letting vectorization clean that up later). That way you're guaranteed to be able to copy-prop them. You increase the number of stores but (hopefully) reduce the number of intermediate variables.
My way to read your proposal (for the second step) is as a sort of
*variable deduplication* (or maybe *dealiasing*?). What's happening in your test program is that `<constructor-2>` and `<output-sv_target0>` are essentially the same variable: as soon as they both are initialized they have the same value, and they will keep it for their whole lifespan. So you can basically replace one with the other one. In theory you can do it either way, but in this case you need `<output-sv_target0>` to survive because it has an externally visible semantic. Kind of, but it's more general than that. In theory the "intermediate" variable (in this case the constructor) can be used elsewhere, and it doesn't have to have the same component count as the "final" one. Although in practice it'll probably just be useful for the more restricted case of equal and "duplicate" variables.
Though I am not sure of what you mean by "generic vectorization of
stores". The peculiar feature that is exploited by this patch is the fact that immediate constants can be vectorized, i.e., you can replace
a.x = 1; a.y = 2; a.z = 3;
with
a.xyz = {1, 2, 3};
But as I said that's rather a property of immediate constants, rather
than of stores. In other words, I would describe a "generic vectorization of stores" as something that replaces
a.x = b.x; a.y = b.y; a.z = b.z;
with
a = b;
That's not something you can do in Francisco's example unless you know
that immediate constants are special, i.e., you can vectorize them even if they come from different sources. Which is basically what my patch was doing (and what I believe this patch does; I say "believe" because I haven't read it yet in detail). So it seems to me that in one way or another you always need to have something like this patch, even if it would be good to also handle the cases in which only some components are immediate constants. No, I think language is just hard :D By "vectorization of stores" I meant the store *to* `a` in this example. I suppose it's redundant since it's not like it's really possible to vectorize anything else. The pass I describe is kind of specific to constants, although there are other vectorization passes you can do that might end up reusing the same infrastructure. E.g. in your second example you can vectorize loads from the same variable. There's also a possibility of vectorizing stores from the same node by creating a broadcast swizzle.
Ok, so, in summary, I identify the following relevant passes:
* **a)** Constant handling in copy-prop (basically, what this first patch and Giovanni's original implementation do). * **b)** vectorization of stores with rhs constants. * **c.1)** vectorization of stores with rhs loads from the same vector within a variable. * **c.2)** alternatively, vectorization of stores with rhs loads from the same register. * **d.1)** split of stores with rhs loads in copy-prop, grouping the components according to the instruction node of their copy_propagation_value. * **d.2)** alternatively, split of stores with rhs loads componet-wise (it seems that this doesn't need to be during copy-prop).
I decided to separate (c) into (c.1) and (c.2) because the latter would be harder to implement with the current architecture but have more reach. Also, it may be more convenient to do it after register allocation.
I think Zeb's "generic vectorization of stores" maps to (b), although before her last message I thought she meant a pass that does both (b) and (c) because it had "generic" in the name. Probably Gio thought the same.
Also, I think both (d.1) and (d.2) map to Zeb's "propagation of load+store sequences into multiple store instructions" proposals.
So, the idea is to achieve all the functionality of (a) and also achieve additional vectorization through the implementation of just (b) and (d).
I will start writing my thoughts on these alternatives.
On Wed Nov 23 15:58:28 2022 +0000, Francisco Casas wrote:
Ok, so, in summary, I identify the following relevant passes:
- **a)** Constant handling in copy-prop (basically, what this first
patch and Giovanni's original implementation do).
- **b)** vectorization of stores with rhs constants.
- **c.1)** vectorization of stores with rhs loads from the same vector
within a variable.
- **c.2)** alternatively, vectorization of stores with rhs loads from
the same register.
- **d.1)** split of stores with rhs loads in copy-prop, grouping the
components according to the instruction node of their copy_propagation_value.
- **d.2)** alternatively, split of stores with rhs loads componet-wise
(it seems that this doesn't need to be during copy-prop). I decided to separate (c) into (c.1) and (c.2) because the latter would be harder to implement with the current architecture but have more reach. Also, it may be more convenient to do it after register allocation. I think Zeb's "generic vectorization of stores" maps to (b), although before her last message I thought she meant a pass that does both (b) and (c) because it had "generic" in the name. Probably Gio thought the same. Also, I think both (d.1) and (d.2) map to Zeb's "propagation of load+store sequences into multiple store instructions" proposals. So, the idea is to achieve all the functionality of (a) and also achieve additional vectorization through the implementation of just (b) and (d). I will start writing my thoughts on these alternatives.
### Regarding (b) and (c), i.e. vectorization:
To achieve vectorization, in particular (c) for, say:
``` a.x = b.x; // other instructions a.y = b.y ```
If we merge the two operations down, we have to make sure that: - `a.x` is not read/write by the other instructions in between. - `b.x` is not written in between.
If we merge the two operations up, we have to make sure that: - `a.y` is not read/write by the other instructions in between. - `b.y` is not written in between, and its hlsl_ir_load is before the `a.x =` instruction, or can be moved there.
Were we also have to consider the possibility of non-constant paths accesing these values.
(b) would be easier to achieve than (c) since, if we replace `b.x` and `b.y` with constants in this example, we know that these values will never be written to. Also, (c) is more complex since we also have to keep an eye on the location of the `b.x` and `b.y` hlsl_ir_load·s in the IR.
Since this pass wouldn't be helping copy-prop, it probably makes sense to run it after the `do ... while(progress)` that includes copy-prop and friends.
To implement this, we can write a function to check whether is possible for a value to be accessed for read/write within a list of instructions, giving an starting point and an end point, and use it to check if these conditions meet for each pair of instructions that are compatible (both are stores from `b` to `a` in different components).
However, to have the same reach as copy-prop, we would also have to consider how to handle control flow with this pass. There may be ifs or loops in between, or one of the instructions may be inside a block where the other is not. Since I don't have clear answers on how to implement the latter cases, I assume (b) and (c) would only operate for pairs of instructions in the same block.
### Regarding (c.1) vs (c.2)
We have the following dilemma:
We can choose between vectorizing at the vector level, or try to do it at the register level. I think the difference only becomes apparent when more than one vector/scalar share the same register.
On one hand, vectorizing register-wise is more general and is a better optimization. On the other, it is more complicated as we either should be considering register offsets at the IR level when running this passes, identifying in the struct if two components share the same register, or we should be derefering the pass after register allocation.
So far, I don't think we need (c) for some reason other than optimizing the output code and making the IR more compact, but it is worth thinking about it.
Actually, (b) is also not excempt of this dilemma, since rhs could also vectorized vector-wise or register-wise. (maybe we should call them (b.1) and (b.2)).
### Regarding (d.1) vs (d.2)
I suspect that (d.1) has not much value over (d.2), since both will split more stores than necessary and that can only be repaired with (c). The only thing is that (d.1) may generate less of them.
If we put (d.1) before or inside the `do ... while(progress)` (or as an extension of copy-prop, which makes sense), it may be splitting with more granularity than necessary, since, revealing that values come from the same instruction may take several of these iterations, and if we place it after, it won't help copy prop.
(d.2) can be put before the `do ... while(progress)` and executed once.
### Regarding (b) and (d) vs. (a)
In my opinion, this reduces to two questions:
1. (b) and (d) indeed cover all the cases that (a) does? 2. What cases do (b) and (d) cover that (a) doesn't?
Surprisingly, now I think that the answer to question (1) is no. Consider the following example which compiles in native and with pass (a):
``` Texture2D t; uniform int i;
float4 main() : sv_target { int2 a = {1, 0};
if (i < 5) { a.y = 3; return t.Load(int3(0, 1, 2), a); } else { a.y = 4; return t.Load(int3(0, 1, 2), a); } } ```
in this case two constant vector2s are to be generated, but as they share a common constant component, the vectorization of the constants alone wouldn't be able to duplicate this component. I didn't put more thought on it, but there may also be more rudimentary examples.
Regarding question (2), at first I thought that the benefit was just a matter of code readability, which alone wouldn't be achieved unless we also implemented (c). However, then I found the following case that (b) and (d) would cover that (a) won't, see the following message.
### The swizzle issue
(a) in its current form is not addressing swizzles.
Consider the following shader:
``` Texture2D t; uniform int i;
float4 main() : sv_target { int3 a = {1, 2, i}; return t.Load(int3(0, 1, 2), a.xy); } ```
even with (a), this results in:
``` 2: int | i 3: int | 1 4: uint | 0 5: | = (a[@4].x @3) 6: int | 2 7: uint | 1 8: | = (a[@7].x @6) 9: uint | 2 10: | = (a[@9].x @2) 11: int3 | {0 1 2 } 12: int3 | a 13: int2 | @12.xy 14: float4 | load_resource(resource = t, sampler = (nil), coords = @11, offset = @13) 15: | return 16: | = (<output-sv_target0> @14) ```
and the offset cannot be resolved as a constant int2.
This may be a good reason to also include (b) + (d)
### Conclusions
I hope my analysis was useful but there are too many things in play and some of my assessments may be wrong, so be aware that there may be errors and point them if you find them! I will also keep thinking to see if I find them or counter examples that change my opinion, there is still a lot to think here.
Having said that, my current opinion is that we should implement (a) + (b.1) + (d.2), but also seek for alternatives that just involve (a) and some way of solving the swizzle problems (if we can make sure that they cover what (b) + (d) does), like the mentioned lowering of swizzles as component-wise stores.
Since I don't think that IR code readability alone has priority over functionality, and AFAIK we can trust the drivers to do vectorization for us, I also think we should defer (c) unless we find an actual feature that requires it.
There is also the alternative of using hlsl_deref·s instead of mere nodes as texel_offsets, but IIRC Zeb has good reasons to not go in that direction.
On Thu Nov 24 00:07:23 2022 +0000, Francisco Casas wrote:
### Regarding (b) and (c), i.e. vectorization: To achieve vectorization, in particular (c) for, say:
a.x = b.x; // other instructions a.y = b.y
If we merge the two operations down, we have to make sure that:
- `a.x` is not read/write by the other instructions in between.
- `b.x` is not written in between.
If we merge the two operations up, we have to make sure that:
- `a.y` is not read/write by the other instructions in between.
- `b.y` is not written in between, and its hlsl_ir_load is before the
`a.x =` instruction, or can be moved there. Were we also have to consider the possibility of non-constant paths accesing these values. (b) would be easier to achieve than (c) since, if we replace `b.x` and `b.y` with constants in this example, we know that these values will never be written to. Also, (c) is more complex since we also have to keep an eye on the location of the `b.x` and `b.y` hlsl_ir_load·s in the IR. Since this pass wouldn't be helping copy-prop, it probably makes sense to run it after the `do ... while(progress)` that includes copy-prop and friends. To implement this, we can write a function to check whether is possible for a value to be accessed for read/write within a list of instructions, giving an starting point and an end point, and use it to check if these conditions meet for each pair of instructions that are compatible (both are stores from `b` to `a` in different components). However, to have the same reach as copy-prop, we would also have to consider how to handle control flow with this pass. There may be ifs or loops in between, or one of the instructions may be inside a block where the other is not. Since I don't have clear answers on how to implement the latter cases, I assume (b) and (c) would only operate for pairs of instructions in the same block.
You probably want to split, at least at the conceptual level, this into two different steps: first, what optimization can you do when the relevant instructions are already consecutive (i.e., your `other instructions` block is empty); second, how to commute instructions in order to make "interesting" instruction consecutive. You're probably going to reuse the second step for many different passes (again, at the conceptual level; it's not obvious that code will be easy to reuse).
So, when can you commute two IR instructions? And when can you move an IR instruction outside or inside a block? Clearly you always have to keep things consequential, so you can't move x after y if y uses the result of x. Then things become a bit more complicated and I'm doing little more than brainstorming, so take this with a grain of salt. Maybe two or three grains. * Constants, expressions and swizzles commute happily with more or less everything else, and can go inside or outside code blocks. * Loads and stores only touch thread-local stuff (I am not sure that "thread" is the right word in the GPU world, but I guess you get what I mean), so commute among themselves (provided they don't touch the same registers) and with resource loads and stores. They don't commute (in general! They do in some cases) with blocks and jumps, or cannot go inside or outside code blocks. * Resource loads and stores are more interesting, because they interact with other threads. The way they commute is dictated by the memory model. Searching the web for "hlsl memory model" or "sm4 memory model" gives basically no useful result, but I guess we can assume that it will be rather weak, so resource loads and stores can commute rather easily, except when they touch the same address. Or maybe even in that case, who knows? The more I discover about GPUs the more I am baffled. At some point we'll bring in barriers to prevent them commuting too much, but for the moment we don't support them. If in doubt, it's more appropriate to err towards a stronger memory model (allowing fewer optimizations, but avoiding introducing miscompilations).
As usual, lots of fun to be had!
We can choose between vectorizing at the vector level, or try to do it at the register level. I think the difference only becomes apparent when more than one vector/scalar share the same register.
On one hand, vectorizing register-wise is more general and is a better optimization. On the other, it is more complicated as we either should be considering register offsets at the IR level when running this passes, identifying in the struct if two components share the same register, or we should be deferring the pass after register allocation.
This is a difficult question, actually. I think the options are
(i) vectorize as much as possible, most notably past 4 components, and let translation into smX split up the ops again. This isn't *bad* conceptually, but the current codegen is really built to not do this.
(ii) vectorize up to 4 components and along aligned boundaries. The problem is that this is leaking backend details out of the backends again.
(iii) don't vectorize at all and make the backends do it instead. Disadvantage is that we may then need to code it multiple times.
There are some passes we'll need to do per-backend *anyway*, e.g. getting rid of the redundant movs from HLSL_IR_LOAD instructions. I'm not sure if this is such a case?
So far, I don't think we need (c) for some reason other than optimizing the output code and making the IR more compact, but it is worth thinking about it.
I'm inclined to agree, yeah. If nothing else it hasn't been *necessary* for anything, it was just a vague idea of something we could extend the pass too eventually.
Actually, (b) is also not exempt of this dilemma, since rhs could also vectorized vector-wise or register-wise. (maybe we should call them (b.1) and (b.2)).
(b) is about vectorizing constants, though? Probably we'd only vectorize them up to 4. Granted, though, you get the same issues as above, and vectorizing constants for sm1 also involves some tradeoffs if we leave that in common code.
- (b) and (d) indeed cover all the cases that (a) does?
- What cases do (b) and (d) cover that (a) doesn't?
Surprisingly, now I think that the answer to question (1) is no. Consider the following example which compiles in native and with pass (a):
Hmm, true. The alternate way to solve that is to pull that variable into the loop, but that requires knowing when the variable should be pulled into the loop (in general we probably want to go the other direction), and I can't easily think of another reason we'd want to do that (to alleviate register pressure, maybe?)
So (a) probably does make the most sense after all.
There is also the alternative of using hlsl_deref·s instead of mere nodes as texel_offsets, but IIRC Zeb has good reasons to not go in that direction.
I don't remember this ever being proposed? I don't see any benefit to it, either.
That we use hlsl_deref instead of hlsl_src for resource/sampler arguments to hlsl_ir_resource_load is... odd. Odd because every other argument expression is hlsl_src, and there's no obvious reason that these would be different. It works, though, because object types can always be resolved to a deref (i.e. if they were hlsl_src, they'd be guaranteed to come from HLSL_IR_LOAD), and it's necessary because we can't translate HLSL_IR_LOAD to an IR instruction.