Is there a problem with this patch? http://source.winehq.org/patches/data/95555
Nozomi
On 13.04.2013 01:55, Nozomi Kodama wrote:
Is there a problem with this patch? http://source.winehq.org/patches/data/95555
Nozomi
Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too...
Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?
Cheers Rico
Hello
thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency.
I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch.
Nozomi.
Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too...
Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?
Cheers Rico
On 15.04.2013 02:10, Nozomi Kodama wrote:
Hello
thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency.
You are right. I'm not really sure why the code should be slower though. The #defines shouldn't have an impact on the performance, but it might be because it is translated to:
ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8]; tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8]; out[1] = 0.0f + ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] = out[0] + 0.28209479f * t; out[6] = 0.0f + -0.12615662f * t; out[8] = 0.0f + -0.21850968f * t;
instead of: ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; out[1] = ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] += 0.28209479f * t; out[6] = -0.12615662f * t; out[8] = -0.21850968f * t;
May be due to "out[8] = -0.21850968f * t;" vs "out[8] = 0.0f + -0.21850968f * t;": 1. the extra 0.0f - Shouldn't the 0.0f get optimized away? Imho the macro could be fixed (e.g. use an if / separate macro / don't use a macro for this cases). 2. "+ -0.21850968f * t;" should be as fast as "-0.21850968f * t" isn't it?
Does anyone else have an objection about this? I just feel that the code size and the always reused constants could be written a little bit nicer. I'm not sure if the macro usage is really better, it was just a quick idea to avoid that much code duplication. Also changes like the suggested below are really error prone, because they change a lot of code mostly using copy and paste. Though I vote for the same style and precision usage in all cases.
Hope this helps a bit Rico
I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch.
Nozomi.
On Mon, Apr 15, 2013 at 10:24:28AM +0200, Rico Sch?ller wrote:
You are right. I'm not really sure why the code should be slower though. The #defines shouldn't have an impact on the performance, but it might be because it is translated to:
ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8]; tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8]; out[1] = 0.0f + ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] = out[0] + 0.28209479f * t; out[6] = 0.0f + -0.12615662f * t; out[8] = 0.0f + -0.21850968f * t;
instead of: ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; out[1] = ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] += 0.28209479f * t; out[6] = -0.12615662f * t; out[8] = -0.21850968f * t;
If everything is 'float' (no doubles anywhere) then I can't see why the above would compile to different code at any sane optimisation level - best to look at the object code.
Unless the compiler knows that out[] can't overlap a[] or b[] the generated code is likely to be better if 't' is evaluated before the write to out[1].
David
These bunch of define are pretty ugly anyway. I prefer a pretty duplicated code rather than these tons of defines. Define should be avoided as much as possible.
________________________________
On 15.04.2013 02:10, Nozomi Kodama wrote:
Hello
thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency.
You are right. I'm not really sure why the code should be slower though. The #defines shouldn't have an impact on the performance, but it might be because it is translated to:
ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8]; tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8]; out[1] = 0.0f + ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] = out[0] + 0.28209479f * t; out[6] = 0.0f + -0.12615662f * t; out[8] = 0.0f + -0.21850968f * t;
instead of: ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; out[1] = ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] += 0.28209479f * t; out[6] = -0.12615662f * t; out[8] = -0.21850968f * t;
May be due to "out[8] = -0.21850968f * t;" vs "out[8] = 0.0f + -0.21850968f * t;": 1. the extra 0.0f - Shouldn't the 0.0f get optimized away? Imho the macro could be fixed (e.g. use an if / separate macro / don't use a macro for this cases). 2. "+ -0.21850968f * t;" should be as fast as "-0.21850968f * t" isn't it?
Does anyone else have an objection about this? I just feel that the code size and the always reused constants could be written a little bit nicer. I'm not sure if the macro usage is really better, it was just a quick idea to avoid that much code duplication. Also changes like the suggested below are really error prone, because they change a lot of code mostly using copy and paste. Though I vote for the same style and precision usage in all cases.
Hope this helps a bit Rico
I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch.
Nozomi.
On 20.04.2013 20:49, Nozomi Kodama wrote:
These bunch of define are pretty ugly anyway. I prefer a pretty duplicated code rather than these tons of defines. Define should be avoided as much as possible.
Well, the only thing why I proposed something like this, was to reduce the code size a bit. As it looks like, the same constants and the same steps are used multiple times. This doesn't have to be defines, static inline functions would also do the job. I'm fine with either solution (this includes your solution), it was only a suggestion. :-)
Cheers Rico
2013/4/15 Nozomi Kodama nozomi.kodama@yahoo.com:
Hello
thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency.
I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch.
Just to chip in about float precision. As already discussed, 32-bit float constants have about 7 decimals digits of precision, so extending these constants from 8 to 10 digits should not change anything. You would have to use doubles if you want more accurate values (but I don't think we should use doubles in d3dx9 dlls anyway). What you can do to ensure numeric accuracy is reviewing computations: the order of the operations (e.g. how you resolve operators associativity, etc) and storing/reusing intermediate results can make a difference. I wouldn't care about this stuff unless we get significantly different results compared to native.
Nozomi.
Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too...
Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?
Cheers Rico
I agrre with you. I should have keep at most 7 floats.
Nozomi
________________________________
2013/4/15 Nozomi Kodama nozomi.kodama@yahoo.com:
Hello
thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency.
I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch.
Just to chip in about float precision. As already discussed, 32-bit float constants have about 7 decimals digits of precision, so extending these constants from 8 to 10 digits should not change anything. You would have to use doubles if you want more accurate values (but I don't think we should use doubles in d3dx9 dlls anyway). What you can do to ensure numeric accuracy is reviewing computations: the order of the operations (e.g. how you resolve operators associativity, etc) and storing/reusing intermediate results can make a difference. I wouldn't care about this stuff unless we get significantly different results compared to native.
Nozomi.
Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too...
Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?
Cheers Rico