On Mon, Apr 15, 2013 at 10:24:28AM +0200, Rico Sch?ller wrote:
You are right. I'm not really sure why the code should be slower though. The #defines shouldn't have an impact on the performance, but it might be because it is translated to:
ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8]; tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8]; out[1] = 0.0f + ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] = out[0] + 0.28209479f * t; out[6] = 0.0f + -0.12615662f * t; out[8] = 0.0f + -0.21850968f * t;
instead of: ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; out[1] = ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] += 0.28209479f * t; out[6] = -0.12615662f * t; out[8] = -0.21850968f * t;
If everything is 'float' (no doubles anywhere) then I can't see why the above would compile to different code at any sane optimisation level - best to look at the object code.
Unless the compiler knows that out[] can't overlap a[] or b[] the generated code is likely to be better if 't' is evaluated before the write to out[1].
David