On 25.02.2013 06:03, Nozomi Kodama wrote:
out.u.m[2][i] = v.z / signed_det;
out.u.m[3][i] = v.w / signed_det; }
*pout = out;
While you are at it, you may fix the indentation of out*, "}", "*pout = out;" and "return pout;".
signed_det = (i % 2)? -det: det;
Couldn't you just use something like "det = -det;" instead of the modulo? This should be a little bit faster.
I did some small tests for speed with the following results. You may also avoid such a lot of variable assignments like *pout = out and you may use 4 vecs instead. This should save ~48 assignments and it should also improve the speed a bit more (~10%). Though, native is still 40% faster than that.
With the change above it should look like: int i; D3DXVECTOR4 v, vec[4]; FLOAT det; ... for (i = 0; i < 4; i++) { vec[i].x = pm->u.m[i][0]; vec[i].y = pm->u.m[i][1]; vec[i].z = pm->u.m[i][2]; vec[i].w = pm->u.m[i][3]; }
for (i = 0; i < 4; i++) { switch (i) { case 0: D3DXVec4Cross(&v, &vec[1], &vec[2], &vec[3]); break; case 1: D3DXVec4Cross(&v, &vec[0], &vec[2], &vec[3]); break; case 2: D3DXVec4Cross(&v, &vec[0], &vec[1], &vec[3]); break; case 3: D3DXVec4Cross(&v, &vec[0], &vec[1], &vec[2]); break; } pout->u.m[0][i] = v.x / det; pout->u.m[1][i] = v.y / det; pout->u.m[2][i] = v.z / det; pout->u.m[3][i] = v.w / det; det = -det; } return pout;
Maybe we could reuse some calculations from the D3DXVec4Cross function ...
Cheers Rico
On 25 February 2013 10:24, Rico Schüller kgbricola@web.de wrote:
I did some small tests for speed with the following results. You may also avoid such a lot of variable assignments like *pout = out and you may use 4 vecs instead. This should save ~48 assignments and it should also improve the speed a bit more (~10%). Though, native is still 40% faster than that.
I'd somewhat expect native to use SSE versions of this kind of thing when the CPU supports those instructions. You also generally want to pay attention to the order in which you access memory, although perhaps it doesn't matter so much here because an entire matrix should be able to fit in a single cacheline, provided it's properly aligned.
On 25.02.2013 11:08, Henri Verbeet wrote:
On 25 February 2013 10:24, Rico Schüller kgbricola@web.de wrote:
I did some small tests for speed with the following results. You may also avoid such a lot of variable assignments like *pout = out and you may use 4 vecs instead. This should save ~48 assignments and it should also improve the speed a bit more (~10%). Though, native is still 40% faster than that.
I'd somewhat expect native to use SSE versions of this kind of thing when the CPU supports those instructions. You also generally want to pay attention to the order in which you access memory, although perhaps it doesn't matter so much here because an entire matrix should be able to fit in a single cacheline, provided it's properly aligned.
Is there a reason why we don't use sse instructions? Or did just no one had a look at it yet?
On 25 February 2013 12:26, Rico Schüller kgbricola@web.de wrote:
Is there a reason why we don't use sse instructions? Or did just no one had a look at it yet?
I think on the one hand there hasn't been much of a need so far, and on the other hand you'd probably need something along the lines of STT_GNU_IFUNC to make it work properly.
Rico,
can you give a try to this patch? If it is slightly slower than native, we could at first merge it.
Anyway, if the application is well coded, this function should not be called often. Usually an application uses transformations matrices that are a lot easier to inverse
Nozomi
________________________________ De : Henri Verbeet hverbeet@gmail.com À : Rico Schüller kgbricola@web.de Cc : wine-devel@winehq.org; Nozomi Kodama nozomi.kodama@yahoo.com Envoyé le : Lundi 25 février 2013 0h08 Objet : Re: d3dx9: Avoid expensive computations
On 25 February 2013 10:24, Rico Schüller kgbricola@web.de wrote:
I did some small tests for speed with the following results. You may also avoid such a lot of variable assignments like *pout = out and you may use 4 vecs instead. This should save ~48 assignments and it should also improve the speed a bit more (~10%). Though, native is still 40% faster than that.
I'd somewhat expect native to use SSE versions of this kind of thing when the CPU supports those instructions. You also generally want to pay attention to the order in which you access memory, although perhaps it doesn't matter so much here because an entire matrix should be able to fit in a single cacheline, provided it's properly aligned.
Hi Nozomi,
this is pretty fast. Just some numbers (run time on my machine, so it might not be that representative)...
before: 43s previous patch: 27s this patch: 21s native: 16s
So from the speed point of view, it's a lot closer than the rest.
Though, I would split this into 2 patches, one for D3DXMatrixDeterminant and one for D3DXMatrixInverse. I think it's a nice step forward. Thought we might test the speed of an sse version and may use it later ...
Are there any other opinions?
Cheers Rico
On 25.02.2013 12:34, Nozomi Kodama wrote:
Rico,
can you give a try to this patch? If it is slightly slower than native, we could at first merge it.
Anyway, if the application is well coded, this function should not be called often. Usually an application uses transformations matrices that are a lot easier to inverse
Nozomi
*De :* Henri Verbeet hverbeet@gmail.com *À :* Rico Schüller kgbricola@web.de *Cc :* wine-devel@winehq.org; Nozomi Kodama nozomi.kodama@yahoo.com *Envoyé le :* Lundi 25 février 2013 0h08 *Objet :* Re: d3dx9: Avoid expensive computations
On 25 February 2013 10:24, Rico Schüller <kgbricola@web.de mailto:kgbricola@web.de> wrote:
I did some small tests for speed with the following results. You may also avoid such a lot of variable assignments like *pout = out and you may
use 4
vecs instead. This should save ~48 assignments and it should also improve the speed a bit more (~10%). Though, native is still 40% faster than
that.
I'd somewhat expect native to use SSE versions of this kind of thing when the CPU supports those instructions. You also generally want to pay attention to the order in which you access memory, although perhaps it doesn't matter so much here because an entire matrix should be able to fit in a single cacheline, provided it's properly aligned.
2013/2/26 Rico Schüller kgbricola@web.de:
Hi Nozomi,
this is pretty fast. Just some numbers (run time on my machine, so it might not be that representative)...
before: 43s previous patch: 27s this patch: 21s native: 16s
So from the speed point of view, it's a lot closer than the rest.
Though, I would split this into 2 patches, one for D3DXMatrixDeterminant and one for D3DXMatrixInverse.
That's probably a good idea.
I think it's a nice step forward. Thought we might test the speed of an sse version and may use it later ...
Are there any other opinions?
My main concern is that the effort in optimizing further those two functions might not have significant effects on actual application execution times (think diminishing returns). I'm not against making the code faster, especially if that doesn't make the code unreadable, but it might not be the best place to work on if you want to optimize d3dx9. You might want to profile some applications and see what the actual bottlenecks are.
Specifically on these functions, an SSE-based version will probably run significantly faster, but you need to solve the issues with compatibility with older CPUs e.g. by selecting the correct function implementation at runtime in some fashion, as Henri mentioned. BTW there might be other potential problems, such as applications setting the SSE control register in some unexpected way (although that happens with the FPU control word too). You can also give a shot to GCC optimization options, such as "-mfpmath=sse" (and a suitable -march value). Obviously we don't want to use them in general but it might be interesting to see what GCC can do there. Keep in mind that the compiler has to stay on the safe side when optimizing and you might need to add attributes around to allow more aggressive optimizations. From a quick Google search I found http://locklessinc.com/articles/vectorize/ which seems to show the general idea.
Cheers, Matteo.
Cheers Rico
On 25.02.2013 12:34, Nozomi Kodama wrote:
Rico,
can you give a try to this patch? If it is slightly slower than native, we could at first merge it.
Anyway, if the application is well coded, this function should not be called often. Usually an application uses transformations matrices that are a lot easier to inverse
Nozomi
*De :* Henri Verbeet hverbeet@gmail.com *À :* Rico Schüller kgbricola@web.de *Cc :* wine-devel@winehq.org; Nozomi Kodama nozomi.kodama@yahoo.com *Envoyé le :* Lundi 25 février 2013 0h08 *Objet :* Re: d3dx9: Avoid expensive computations
On 25 February 2013 10:24, Rico Schüller <kgbricola@web.de mailto:kgbricola@web.de> wrote:
I did some small tests for speed with the following results. You may
also
avoid such a lot of variable assignments like *pout = out and you may
use 4
vecs instead. This should save ~48 assignments and it should also
improve
the speed a bit more (~10%). Though, native is still 40% faster than
that.
I'd somewhat expect native to use SSE versions of this kind of thing when the CPU supports those instructions. You also generally want to pay attention to the order in which you access memory, although perhaps it doesn't matter so much here because an entire matrix should be able to fit in a single cacheline, provided it's properly aligned.
On Mon, Feb 25, 2013 at 11:08:02AM +0100, Henri Verbeet wrote:
On 25 February 2013 10:24, Rico Sch?ller kgbricola@web.de wrote:
I did some small tests for speed with the following results. You may also avoid such a lot of variable assignments like *pout = out and you may use 4 vecs instead. This should save ~48 assignments and it should also improve the speed a bit more (~10%). Though, native is still 40% faster than that.
I'd somewhat expect native to use SSE versions of this kind of thing when the CPU supports those instructions. You also generally want to pay attention to the order in which you access memory, although perhaps it doesn't matter so much here because an entire matrix should be able to fit in a single cacheline, provided it's properly aligned.
Also make sure that are memory that is written to can't be aliased by the memory that is read. If aliasing is possible the compiler has sequence the code to ensure the writes and reads happen in the correct order.
That function probably has too many live values for the normal registers - so some values will get flushed to stack. SSE might be better.
David