Adam Martinson amartinson@codeweavers.com writes:
@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif
+#if defined(__GNUC__) && (GCC_VERSION >= 30406)
- #define ctz(x) __builtin_ctz(x)
+#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
- static inline int ctz( unsigned int x )
- {
int ret;
__asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
return ret;
- }
+#else
- #define ctz(x) (ffs(x)-1)
+#endif
There's no reason to add this. Just use ffs().
On 03/16/2011 08:34 AM, Alexandre Julliard wrote:
Adam Martinsonamartinson@codeweavers.com writes:
@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif
+#if defined(__GNUC__)&& (GCC_VERSION>= 30406)
- #define ctz(x) __builtin_ctz(x)
+#elif defined(__GNUC__)&& (defined(__i386__) || defined(__x86_64__))
- static inline int ctz( unsigned int x )
- {
int ret;
__asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
return ret;
- }
+#else
- #define ctz(x) (ffs(x)-1)
+#endif
There's no reason to add this. Just use ffs().
If I thought ffs() was adequate, I would. I need this for iterating sparse bitsets.
__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax
(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax add $0x1,%eax sub $0x1,%eax
... Fortunately -O2 catches the add/sub. So yes, there is a reason, ctz() is at least 50% faster.
On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
On 03/16/2011 08:34 AM, Alexandre Julliard wrote:
Adam Martinsonamartinson@codeweavers.com writes:
@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif
+#if defined(__GNUC__)&& (GCC_VERSION>= 30406)
- #define ctz(x) __builtin_ctz(x)
+#elif defined(__GNUC__)&& (defined(__i386__) || defined(__x86_64__))
- static inline int ctz( unsigned int x )
- {
int ret;
__asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
return ret;
- }
+#else
- #define ctz(x) (ffs(x)-1)
+#endif
There's no reason to add this. Just use ffs().
If I thought ffs() was adequate, I would. I need this for iterating sparse bitsets.
__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax
(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax add $0x1,%eax sub $0x1,%eax
... Fortunately -O2 catches the add/sub. So yes, there is a reason, ctz() is at least 50% faster.
You are optimizing in the wrong spot.
If this is not in performance relevant code, readability is always better than hacks.
ciao, Marcus
On 03/17/2011 02:54 AM, Marcus Meissner wrote:
On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
On 03/16/2011 08:34 AM, Alexandre Julliard wrote:
Adam Martinsonamartinson@codeweavers.com writes:
@@ -239,6 +243,19 @@ extern int getopt_long_only (int ___argc, char *const *___argv, int ffs( int x ); #endif
+#if defined(__GNUC__)&& (GCC_VERSION>= 30406)
- #define ctz(x) __builtin_ctz(x)
+#elif defined(__GNUC__)&& (defined(__i386__) || defined(__x86_64__))
- static inline int ctz( unsigned int x )
- {
int ret;
__asm__("bsfl %1, %0" : "=r" (ret) : "r" (x));
return ret;
- }
+#else
- #define ctz(x) (ffs(x)-1)
+#endif
There's no reason to add this. Just use ffs().
If I thought ffs() was adequate, I would. I need this for iterating sparse bitsets.
__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax
(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax add $0x1,%eax sub $0x1,%eax
... Fortunately -O2 catches the add/sub. So yes, there is a reason, ctz() is at least 50% faster.
You are optimizing in the wrong spot.
If this is not in performance relevant code, readability is always better than hacks.
ciao, Marcus
I agree 100% with the 2nd part; this is for some of the functions that top the CPU time charts in wined3d.
On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax
(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax
...
So yes, there is a reason, ctz() is at least 50% faster.
I'm not where you get 50% from!
I've read both the intel and amd x86 instruction performance manuals (but can't clain to remember all of it!). The 'bsf' will be a slow instruction (with constraints on where it exectutes, and what can execute in parallel). The 'cmove' has even worse constraints since it can't execute until the 'flags' from the previous instruction are known. cmove is only slightly better than a mis-predicted branch! In this case there will be complete pipeline stall between the 'bsf' and the 'cmove'.
ffs would probably execute faster with a forwards conditional branch (predicted not taken) in the 'return -1' path.
David
On 03/17/2011 03:23 AM, David Laight wrote:
On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax
(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax
...
So yes, there is a reason, ctz() is at least 50% faster.
I'm not where you get 50% from!
I've read both the intel and amd x86 instruction performance manuals (but can't clain to remember all of it!). The 'bsf' will be a slow instruction (with constraints on where it exectutes, and what can execute in parallel). The 'cmove' has even worse constraints since it can't execute until the 'flags' from the previous instruction are known. cmove is only slightly better than a mis-predicted branch! In this case there will be complete pipeline stall between the 'bsf' and the 'cmove'.
ffs would probably execute faster with a forwards conditional branch (predicted not taken) in the 'return -1' path.
David
For my purposes the 'return -1' path will never be taken, so it's not needed. I want to be able to do:
while (bitmap) { i = ctz(bitmap); /* get the LSB index */ bitmap ^= 1 << i; /* zero LSB */
/* for each set bit i... */ }
If there's a faster way to do this I'm all ears.