Re: wineport: Add support for ctz().

List overview All Threads

newer

older

Re: msvcrt: add implementation of...

Re: usp10: corrections to...

Alexandre Julliard

16 Mar 2011 16 Mar '11

1:34 p.m.

Adam Martinson amartinson@codeweavers.com writes:

...

@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif

+#if defined(__GNUC__) && (GCC_VERSION >= 30406)

#define ctz(x) __builtin_ctz(x)

+#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
static inline int ctz( unsigned int x )

{
   int ret;
   __asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
   return ret;
}
+#else

#define ctz(x) (ffs(x)-1)

+#endif

There's no reason to add this. Just use ffs().

-- Alexandre Julliard julliard@winehq.org

Show replies by date

Adam Martinson

16 Mar 16 Mar

6:26 p.m.

New subject: wineport: Add support for ctz().

On 03/16/2011 08:34 AM, Alexandre Julliard wrote:

...

Adam Martinsonamartinson@codeweavers.com writes:

...
@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif

+#if defined(__GNUC__)&& (GCC_VERSION>= 30406)

#define ctz(x) __builtin_ctz(x)

+#elif defined(__GNUC__)&& (defined(__i386__) || defined(__x86_64__))
static inline int ctz( unsigned int x )

{
   int ret;
   __asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
   return ret;
}
+#else

#define ctz(x) (ffs(x)-1)

+#endif
There's no reason to add this. Just use ffs().

If I thought ffs() was adequate, I would. I need this for iterating sparse bitsets.

__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax

(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax add $0x1,%eax sub $0x1,%eax

... Fortunately -O2 catches the add/sub. So yes, there is a reason, ctz() is at least 50% faster.

Marcus Meissner

17 Mar 17 Mar

7:54 a.m.

New subject: wineport: Add support for ctz().

On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:

...

On 03/16/2011 08:34 AM, Alexandre Julliard wrote:

...
Adam Martinsonamartinson@codeweavers.com writes:

...
@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif

+#if defined(__GNUC__)&& (GCC_VERSION>= 30406)

#define ctz(x) __builtin_ctz(x)

+#elif defined(__GNUC__)&& (defined(__i386__) || defined(__x86_64__))
static inline int ctz( unsigned int x )

{
   int ret;
   __asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
   return ret;
}
+#else

#define ctz(x) (ffs(x)-1)

+#endif
There's no reason to add this. Just use ffs().
If I thought ffs() was adequate, I would. I need this for iterating sparse bitsets.

__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax

(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax add $0x1,%eax sub $0x1,%eax

... Fortunately -O2 catches the add/sub. So yes, there is a reason, ctz() is at least 50% faster.

You are optimizing in the wrong spot.

If this is not in performance relevant code, readability is always better than hacks.

ciao, Marcus

Adam Martinson

5:24 p.m.

New subject: wineport: Add support for ctz().

On 03/17/2011 02:54 AM, Marcus Meissner wrote:

...

On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:

...
On 03/16/2011 08:34 AM, Alexandre Julliard wrote:

...
Adam Martinsonamartinson@codeweavers.com writes:

...
@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif

+#if defined(__GNUC__)&& (GCC_VERSION>= 30406)

#define ctz(x) __builtin_ctz(x)

+#elif defined(__GNUC__)&& (defined(__i386__) || defined(__x86_64__))
static inline int ctz( unsigned int x )

{
   int ret;
   __asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
   return ret;
}
+#else

#define ctz(x) (ffs(x)-1)

+#endif
There's no reason to add this. Just use ffs().
If I thought ffs() was adequate, I would. I need this for iterating sparse bitsets.

__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax

(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax add $0x1,%eax sub $0x1,%eax

... Fortunately -O2 catches the add/sub. So yes, there is a reason, ctz() is at least 50% faster.
You are optimizing in the wrong spot.

If this is not in performance relevant code, readability is always better than hacks.

ciao, Marcus

I agree 100% with the 2nd part; this is for some of the functions that top the CPU time charts in wined3d.

David Laight

8:23 a.m.

New subject: wineport: Add support for ctz().

On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:

...

__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax

(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax

...

So yes, there is a reason, ctz() is at least 50% faster.

I'm not where you get 50% from!

I've read both the intel and amd x86 instruction performance manuals (but can't clain to remember all of it!). The 'bsf' will be a slow instruction (with constraints on where it exectutes, and what can execute in parallel). The 'cmove' has even worse constraints since it can't execute until the 'flags' from the previous instruction are known. cmove is only slightly better than a mis-predicted branch! In this case there will be complete pipeline stall between the 'bsf' and the 'cmove'.

ffs would probably execute faster with a forwards conditional branch (predicted not taken) in the 'return -1' path.

David

-- David Laight: david@l8s.co.uk

Adam Martinson

5:20 p.m.

New subject: wineport: Add support for ctz().

On 03/17/2011 03:23 AM, David Laight wrote:

...

On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:

...
__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax

(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax

...

...
So yes, there is a reason, ctz() is at least 50% faster.

I'm not where you get 50% from!

I've read both the intel and amd x86 instruction performance manuals (but can't clain to remember all of it!). The 'bsf' will be a slow instruction (with constraints on where it exectutes, and what can execute in parallel). The 'cmove' has even worse constraints since it can't execute until the 'flags' from the previous instruction are known. cmove is only slightly better than a mis-predicted branch! In this case there will be complete pipeline stall between the 'bsf' and the 'cmove'.

ffs would probably execute faster with a forwards conditional branch (predicted not taken) in the 'return -1' path.

David

For my purposes the 'return -1' path will never be taken, so it's not needed. I want to be able to do:

while (bitmap) { i = ctz(bitmap); /* get the LSB index */ bitmap ^= 1 << i; /* zero LSB */

/* for each set bit i... */ }

If there's a faster way to do this I'm all ears.

5342

Age (days ago)

5343

Last active (days ago)

wine-devel@winehq.org

5 comments

4 participants

tags (0)

participants (4)

Adam Martinson
Alexandre Julliard
David Laight
Marcus Meissner