On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
__builtin_ctz() compiles to: mov 0x8(%ebp),%eax bsf %eax,%eax
(ffs()-1) compiles to: mov $0xffffffff,%edx bsf 0x8(%ebp),%eax cmove %edx,%eax
...
So yes, there is a reason, ctz() is at least 50% faster.
I'm not where you get 50% from!
I've read both the intel and amd x86 instruction performance manuals (but can't clain to remember all of it!). The 'bsf' will be a slow instruction (with constraints on where it exectutes, and what can execute in parallel). The 'cmove' has even worse constraints since it can't execute until the 'flags' from the previous instruction are known. cmove is only slightly better than a mis-predicted branch! In this case there will be complete pipeline stall between the 'bsf' and the 'cmove'.
ffs would probably execute faster with a forwards conditional branch (predicted not taken) in the 'return -1' path.
David