On Wed May 3 16:08:53 2023 +0000, Giovanni Mascellani wrote:
The MR is ok for me, except for the little remark about comment wording. I just wanted to notice that the new algorithm has higher computational complexity than before, because `get_available_writemask()` used to be constant time and it's now linear in the number of register allocations. This already causes a measurable performance hit in a synthetic but still relatively simple shader as this:
uniform float4x4 x; uniform float4x4 y; float4 main(float4 pos : sv_position) : sv_target { float4x4 a = mul(mul(y, x), mul(x, x)); float4x4 b = mul(mul(y, x), mul(y, x)); float4x4 c = mul(mul(y, y), mul(x, x)); float4x4 d = mul(mul(y, y), mul(y, x)); float4 ret = 0.0; ret += a[0] - b[0] * c[0] / d[0]; ret += a[1] - b[1] * c[1] / d[1]; ret += a[2] - b[2] * c[2] / d[2]; ret += a[3] - b[3] * c[3] / d[3]; return ret; }
Here I am leveraging `mul()` to create a lot of temporaries and summing everything to prevent DCE from optimizing too much. On my computer a shader runner that just compiles this (doesn't execute it) takes 0.1 seconds before this MR and 0.11 seconds after it. I don't claim any significance for my random microbenchmark experiment, so I don't think it's necessary to change the MR, but when and if we'll be harvesting for performances in the HLSL compiler let's remember to have a look here.
We could probably do better by just recording allocations for the few cases where we need to reserve, and then using the old pass for everything else. But it's probably not worth rewriting this again until we see evidence it matters.