[PATCH v8 0/1] MR723: vkd3d-shader/spirv: Implement MAD in two operations if flagged as precise.
This would eliminate the todo for the precise mad() test in !718. Maybe we need test results on nvidia and intel to decide if we actually want this. -- v8: vkd3d-shader/ir: Implement MAD in two operations if flagged as precise. https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723
From: Conor McCarthy <cmccarthy(a)codeweavers.com> --- libs/vkd3d-shader/ir.c | 61 +++++++++++++++++-- libs/vkd3d-shader/vkd3d_shader_private.h | 25 ++++++++ .../hlsl/arithmetic-float-uniform.shader_test | 2 +- 3 files changed, 81 insertions(+), 7 deletions(-) diff --git a/libs/vkd3d-shader/ir.c b/libs/vkd3d-shader/ir.c index 8af537390..9ba0de940 100644 --- a/libs/vkd3d-shader/ir.c +++ b/libs/vkd3d-shader/ir.c @@ -94,7 +94,48 @@ static bool vsir_instruction_init_with_params(struct vsir_program *program, return true; } -static enum vkd3d_result vsir_program_lower_texkills(struct vsir_program *program) +/* The Shader Model 5 Assembly documentation states: "If components of a mad + * instruction are tagged as precise, the hardware must execute a mad instruction + * or the exact equivalent, and it cannot split it into a multiply followed by an add." + * But DXIL.rst states the opposite: "Floating point multiply & add. This operation is + * not fused for "precise" operations." + * Windows drivers seem to conform with the latter, for SM 4-5 and SM 6. */ +static bool vsir_program_handle_precise_mad(struct vsir_program *program, size_t i, + unsigned int tmp_idx) +{ + struct vkd3d_shader_instruction_array *instructions = &program->instructions; + struct vkd3d_shader_instruction *mul_ins, *add_ins; + struct vkd3d_shader_dst_param *mul_dst; + + if (!shader_instruction_array_insert_at(instructions, i + 1, 1)) + return false; + mul_ins = &instructions->elements[i]; + add_ins = mul_ins + 1; + + mul_ins->handler_idx = VKD3DSIH_MUL; + mul_ins->src_count = 2; + + if (!(vsir_instruction_init_with_params(program, add_ins, &mul_ins->location, VKD3DSIH_ADD, 1, 2))) + return false; + add_ins->flags = mul_ins->flags & VKD3DSI_PRECISE_XYZW; + + mul_dst = mul_ins->dst; + *add_ins->dst = *mul_dst; + + mul_dst->modifiers = 0; + vsir_register_init(&mul_dst->reg, VKD3DSPR_TEMP, mul_ins->src[0].reg.data_type, 1); + mul_dst->reg.dimension = add_ins->dst->reg.dimension; + mul_dst->reg.idx[0].offset = tmp_idx; + + add_ins->src[0].reg = mul_dst->reg; + add_ins->src[0].swizzle = vsir_swizzle_from_writemask(mul_dst->write_mask); + add_ins->src[0].modifiers = 0; + add_ins->src[1] = mul_ins->src[2]; + + return true; +} + +static enum vkd3d_result vsir_program_lower_texkills_and_precise_mad(struct vsir_program *program) { struct vkd3d_shader_instruction_array *instructions = &program->instructions; struct vkd3d_shader_instruction *texkill_ins, *ins; @@ -106,15 +147,23 @@ static enum vkd3d_result vsir_program_lower_texkills(struct vsir_program *progra { texkill_ins = &instructions->elements[i]; - if (texkill_ins->handler_idx != VKD3DSIH_TEXKILL) + if (texkill_ins->handler_idx != VKD3DSIH_TEXKILL + && (texkill_ins->handler_idx != VKD3DSIH_MAD || !(texkill_ins->flags & VKD3DSI_PRECISE_XYZW))) continue; - if (!shader_instruction_array_insert_at(instructions, i + 1, components_read + 1)) - return VKD3D_ERROR_OUT_OF_MEMORY; - if (tmp_idx == ~0u) tmp_idx = program->temp_count++; + if (texkill_ins->handler_idx == VKD3DSIH_MAD) + { + if (!vsir_program_handle_precise_mad(program, i, tmp_idx)) + return VKD3D_ERROR_OUT_OF_MEMORY; + continue; + } + + if (!shader_instruction_array_insert_at(instructions, i + 1, components_read + 1)) + return VKD3D_ERROR_OUT_OF_MEMORY; + /* tmp = ins->dst[0] < 0 */ ins = &instructions->elements[i + 1]; @@ -5461,7 +5510,7 @@ enum vkd3d_result vsir_program_normalise(struct vsir_program *program, uint64_t remove_dcl_temps(program); - if ((result = vsir_program_lower_texkills(program)) < 0) + if ((result = vsir_program_lower_texkills_and_precise_mad(program)) < 0) return result; if (program->shader_version.major >= 6) diff --git a/libs/vkd3d-shader/vkd3d_shader_private.h b/libs/vkd3d-shader/vkd3d_shader_private.h index b07a7bff7..f401fd7a3 100644 --- a/libs/vkd3d-shader/vkd3d_shader_private.h +++ b/libs/vkd3d-shader/vkd3d_shader_private.h @@ -1761,6 +1761,31 @@ static inline unsigned int vkd3d_compact_swizzle(uint32_t swizzle, uint32_t writ return compacted_swizzle; } +static inline uint32_t vsir_swizzle_from_writemask(unsigned int writemask) +{ + static const unsigned int swizzles[16] = + { + 0, + VKD3D_SHADER_SWIZZLE(X, X, X, X), + VKD3D_SHADER_SWIZZLE(Y, Y, Y, Y), + VKD3D_SHADER_SWIZZLE(X, Y, X, X), + VKD3D_SHADER_SWIZZLE(Z, Z, Z, Z), + VKD3D_SHADER_SWIZZLE(X, Z, X, X), + VKD3D_SHADER_SWIZZLE(Y, Z, X, X), + VKD3D_SHADER_SWIZZLE(X, Y, Z, X), + VKD3D_SHADER_SWIZZLE(W, W, W, W), + VKD3D_SHADER_SWIZZLE(X, W, X, X), + VKD3D_SHADER_SWIZZLE(Y, W, X, X), + VKD3D_SHADER_SWIZZLE(X, Y, W, X), + VKD3D_SHADER_SWIZZLE(Z, W, X, X), + VKD3D_SHADER_SWIZZLE(X, Z, W, X), + VKD3D_SHADER_SWIZZLE(Y, Z, W, X), + VKD3D_SHADER_SWIZZLE(X, Y, Z, W), + }; + + return swizzles[writemask & 0xf]; +} + struct vkd3d_struct { enum vkd3d_shader_structure_type type; diff --git a/tests/hlsl/arithmetic-float-uniform.shader_test b/tests/hlsl/arithmetic-float-uniform.shader_test index 8bc3992e7..c6b7caaae 100644 --- a/tests/hlsl/arithmetic-float-uniform.shader_test +++ b/tests/hlsl/arithmetic-float-uniform.shader_test @@ -121,7 +121,7 @@ uniform 0 float4 1.00000007 -42.1 4.0 45.0 uniform 4 float4 1.625 -5.0 4.125 5.0 uniform 8 float4 1.00000007 -1.0 0.5 -0.5 todo(sm<6) draw quad -todo probe all rgba (2.62500048, 209.5, 17.0, 224.5) +probe all rgba (2.62500048, 209.5, 17.0, 224.5) [require] shader model >= 5.0 -- GitLab https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723
On Wed Apr 10 14:22:12 2024 +0000, Conor McCarthy wrote:
We have no other passes which may need to insert instructions. It's a compromise between adding yet another instruction copying pass, versus favouring performance even if it's a little messy. If the former is the priority I'll add a pass for all the small operations. I added it to `lower_texkills()`. Moving instructions for each occurrence takes quadratic time, so we could see pathological cases. For example, Cyberpunk has a colossal compute shader with over 40,000 SSA values, so a shader of similar size which uses precise `MAD` could occur. If necessary we could convert `lower_texkills_and_precise_mad()` to copy instead of insert.
-- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723#note_67573
On Thu Apr 11 00:57:31 2024 +0000, Conor McCarthy wrote:
I added it to `lower_texkills()`. Moving instructions for each occurrence takes quadratic time, so we could see pathological cases. For example, Cyberpunk has a colossal compute shader with over 40,000 SSA values, so a shader of similar size which uses precise `MAD` could occur. If necessary we could convert `lower_texkills_and_precise_mad()` to copy instead of insert. Yeah, I still don't like the way that pass is structured and I think it would be a better idea to rewrite it in terms of a full program copy like many other passes are structured. But that's an independent issue that might be addressed at some point: in terms of software design I think it's better to lump all these smaller changes to the shader in a single pass.
However, what do you think of refactoring your MR in a manner [similar to this](https://gitlab.winehq.org/giomasce/vkd3d/-/commits/proposal) (last three commits)? -- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723#note_67586
On Thu Apr 11 10:45:37 2024 +0000, Giovanni Mascellani wrote: > Yeah, I still don't like the way that pass is structured and I think it > would be a better idea to rewrite it in terms of a full program copy > like many other passes are structured. But that's an independent issue > that might be addressed at some point: in terms of software design I > think it's better to lump all these smaller changes to the shader in a > single pass. > However, what do you think of refactoring your MR in a manner [similar > to this](https://gitlab.winehq.org/giomasce/vkd3d/-/commits/proposal) > (last three commits)? Sorry for the double comment, but let me add another couple of remarks: * At some point it might happen that the instruction written by one of the helpers called by `vsir_program_lower_instruction()` should itself be processed again. To handle this we could have the switch cases return `VKD3D_OK` when they did something and `VKD3D_FALSE` when they touched nothing, so that the loop in `vsir_program_lower_instruction()` can decide whether to increment the position or not. * Both the TEXKILL and the MAD helpers currently use a TEMP for storing the intermediate values. This should probably become a SSA at some point; even for SM4 there shouldn't be problem for passes to introduce SSAs, provided that they are written only once and the definition point dominates the usage points. The SPIR-V backend should handle them just fine. This has the advantage of not requiring to pass a pointer to all the helpers to coordinate the TEMP index, and I guess in general the less we use TEMPs the better. -- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723#note_67587
On Thu Apr 11 10:51:42 2024 +0000, Giovanni Mascellani wrote:
Sorry for the double comment, but let me add another couple of remarks: * At some point it might happen that the instruction written by one of the helpers called by `vsir_program_lower_instruction()` should itself be processed again. To handle this we could have the switch cases return `VKD3D_OK` when they did something and `VKD3D_FALSE` when they touched nothing, so that the loop in `vsir_program_lower_instruction()` can decide whether to increment the position or not. * Both the TEXKILL and the MAD helpers currently use a TEMP for storing the intermediate values. This should probably become a SSA at some point; even for SM4 there shouldn't be problem for passes to introduce SSAs, provided that they are written only once and the definition point dominates the usage points. The SPIR-V backend should handle them just fine. This has the advantage of not requiring to pass a pointer to all the helpers to coordinate the TEMP index, and I guess in general the less we use TEMPs the better. Your branch looks good to me. Would you like to submit a new MR with those three commits, while I close mine? A couple of comments: there's a typo in the commit message for DCL_TEMPS, and I would rename `vsir_program_lower_mad` to `vsir_program_lower_precise_mad` for accuracy.
To handle this we could have the switch cases return `VKD3D_OK` when they did something and `VKD3D_FALSE` when they touched nothing
That makes sense if the need arises.
This should probably become a SSA at some point
Yes, and the coordinates and texels which the DXIL parser emits should probably be changed at the same time, though we would need a COMPOSITE_CONSTRUCT instruction to make those SSA. Maybe after the release. -- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723#note_67600
Yeah, I still don't like the way that pass is structured and I think it would be a better idea to rewrite it in terms of a full program copy like many other passes are structured.
I broadly disagree. TEXKILL isn't that common, and all this pre-emptive copying that a number of these passes do can't possibly be cheaper than an efficient implementation of shader_instruction_array_insert_at() would be. The existing implementation of shader_instruction_array_insert_at() certainly isn't optimal, but that's an issue that can be addressed there; it's not a problem with the passes using it.
But that's an independent issue that might be addressed at some point: in terms of software design I think it's better to lump all these smaller changes to the shader in a single pass.
However, what do you think of refactoring your MR in a manner [similar to this](https://gitlab.winehq.org/giomasce/vkd3d/-/commits/proposal) (last three commits)?
I didn't review those commits in detail, but in terms of the structure, that's pretty much exactly what I had in mind. -- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723#note_67601
This merge request was approved by Giovanni Mascellani. -- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723
On Thu Apr 11 13:44:41 2024 +0000, Conor McCarthy wrote:
Your branch looks good to me. Would you like to submit a new MR with those three commits, while I close mine? A couple of comments: there's a typo in the commit message for DCL_TEMPS, and I would rename `vsir_program_lower_mad` to `vsir_program_lower_precise_mad` for accuracy.
To handle this we could have the switch cases return `VKD3D_OK` when they did something and `VKD3D_FALSE` when they touched nothing That makes sense if the need arises. This should probably become a SSA at some point Yes, and the coordinates and texels which the DXIL parser emits should probably be changed at the same time, though we would need a COMPOSITE_CONSTRUCT instruction to make those SSA. Maybe after the release. Ok, submitted at !779.
-- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723#note_67612
This merge request was closed by Conor McCarthy. -- https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/723
participants (4)
-
Conor McCarthy -
Conor McCarthy (@cmccarthy) -
Giovanni Mascellani (@giomasce) -
Henri Verbeet (@hverbeet)