Hi,
Just did not feel like chasing bugs the other day. I decided to have some fun with something that I wondering for a long time: the usefulness of inline i86 assembly in string functions.
This is the test program as.c:
---------------------------------8<------------------------------------- #include <malloc.h> typedef unsigned short WCHAR, *PWCHAR;
static inline WCHAR *strcpyW( WCHAR *dst, const WCHAR *src ) { #ifdef ASM int dummy1, dummy2, dummy3; __asm__ __volatile__( "cld\n" "1:\tlodsw\n\t" "stosw\n\t" "testw %%ax,%%ax\n\t" "jne 1b" : "=&S" (dummy1), "=&D" (dummy2), "=&a" (dummy3) : "0" (src), "1" (dst) : "memory" ); #else WCHAR *p = dst; while ((*p++ = *src++)); #endif return dst; }
#define SZ 3000 main() { int i; PWCHAR s,d; s=malloc(SZ*sizeof(WCHAR)); d=malloc(SZ*sizeof(WCHAR)); memset(s,'x',SZ); s[SZ-1]=0; for(i=0;i<1000000;i++) strcpyW(d,s); } ---------------------------------8<-------------------------------------
The function strcpyW is a copy from Wine with the #ifdef modified.
I used the following commands
gcc-3.3 -O2 as.c -o as -DASM ; time ./as;time ./as; time ./as
and
gcc-3.3 -O2 as.c -o as ; time ./as;time ./as; time ./as
The resulting times are (all user time):
test# asm C ----------------------- 1 15.970 15.899 2 15.966 15.943 3 15.959 15.941 ------ ------ ave 15.964 15.928
Notes: - tested on a PII 450 MHz; - I tested with gcc 2.95 and 3.4.2 as well, result are essentially the same. - size of main() is 0x7a (assembly) vs 0x82 (C-code) bytes; - I experimented with longer strings to see if there was any mem cache hit/miss effects and found none.
Conclusions:
1. these routines are so fast that it is hard to imagine that these functions will be a bottleneck, justifying such optimization; 2. nothing shows here that inline assembly brings any advantage.
Rein.
On Tue, Sep 21, 2004 at 02:57:39PM +0200, Rein Klazes wrote:
Hi,
Just did not feel like chasing bugs the other day. I decided to have some fun with something that I wondering for a long time: the usefulness of inline i86 assembly in string functions.
Well, you could do unrolling and larger block moves and the like.
However, more speed would be gained from other algorithmical changes.
The redrawing speed is still very painful for some apps ;)
Ciao, Marcus
On Tue, Sep 21, 2004 at 03:53:35PM +0200, Marcus Meissner wrote:
On Tue, Sep 21, 2004 at 02:57:39PM +0200, Rein Klazes wrote:
Hi,
Just did not feel like chasing bugs the other day. I decided to have some fun with something that I wondering for a long time: the usefulness of inline i86 assembly in string functions.
Well, you could do unrolling and larger block moves and the like.
mmm displacing the rest of the programs working set from the I-cache.
However, more speed would be gained from other algorithmical changes.
Like avoiding the use of 'rep movs' for small values of %ecx
David
On Tue, Sep 21, 2004 at 11:37:44PM +0100, David Laight wrote:
On Tue, Sep 21, 2004 at 03:53:35PM +0200, Marcus Meissner wrote:
On Tue, Sep 21, 2004 at 02:57:39PM +0200, Rein Klazes wrote:
Hi,
Just did not feel like chasing bugs the other day. I decided to have some fun with something that I wondering for a long time: the usefulness of inline i86 assembly in string functions.
Well, you could do unrolling and larger block moves and the like.
mmm displacing the rest of the programs working set from the I-cache.
However, more speed would be gained from other algorithmical changes.
Like avoiding the use of 'rep movs' for small values of %ecx
Yes.
However, the compiler will slowly start to do that anyway (if it doesn't already), so we should not bother.
Ciao, Marcus
Rein Klazes rklazes@xs4all.nl writes:
Conclusions:
- these routines are so fast that it is hard to imagine that these
functions will be a bottleneck, justifying such optimization; 2. nothing shows here that inline assembly brings any advantage.
You are right, that assembly code is more confusing than helpful. I've removed it.