The MR is ok for me, except for the little remark about comment wording.
I just wanted to notice that the new algorithm has higher computational complexity than before, because `get_available_writemask()` used to be constant time and it's now linear in the number of register allocations. This already causes a measurable performance hit in a synthetic but still relatively simple shader as this: ``` uniform float4x4 x; uniform float4x4 y;
float4 main(float4 pos : sv_position) : sv_target { float4x4 a = mul(mul(y, x), mul(x, x)); float4x4 b = mul(mul(y, x), mul(y, x)); float4x4 c = mul(mul(y, y), mul(x, x)); float4x4 d = mul(mul(y, y), mul(y, x));
float4 ret = 0.0; ret += a[0] - b[0] * c[0] / d[0]; ret += a[1] - b[1] * c[1] / d[1]; ret += a[2] - b[2] * c[2] / d[2]; ret += a[3] - b[3] * c[3] / d[3];
return ret; } ```
Here I am leveraging `mul()` to create a lot of temporaries and summing everything to prevent DCE from optimizing too much. On my computer a shader runner that just compiles this (doesn't execute it) takes 0.1 seconds before this MR and 0.11 seconds after it.
I don't claim any significance for my random microbenchmark experiment, so I don't think it's necessary to change the MR, but when and if we'll be harvesting for performances in the HLSL compiler let's remember to have a look here.