Hi,
I have a suggestion for a faster implementation of the zero_bit_scan in RtlFindClearBits [NTDLL.@] (rlbitmap.c) for e.g. TlsAlloc() The main is the usage of the instruction 'bsf eax, eax'
This I have implemented in the new experimental odinxp-tree for finding the first zero_bit in the first 'bytecount' bytes of the bitmap addr.
Dietrich
public _search_zero_bit ;int CDECL search_zero_bit(int bytecount, void *addr); _search_zero_bit proc near push esi push edx
mov esi, [esp+16] mov edx, [esp+12] inc edx loop: add esi, 4 mov eax, -1 dec edx ;not found jz short found xor eax, dword ptr [esi - 4] jz loop bsf eax, eax jz short loop sub esi, 4 sub esi, [esp+16] shl esi, 5-2 add eax, esi found:
pop edx pop esi ret _search_zero_bit endp
On Thu, 10 Feb 2005 18:59:21 +0100, Dietrich Teickner wrote:
I have a suggestion for a faster implementation of the zero_bit_scan in RtlFindClearBits [NTDLL.@] (rlbitmap.c) for e.g. TlsAlloc() The main is the usage of the instruction 'bsf eax, eax'
This I have implemented in the new experimental odinxp-tree for finding the first zero_bit in the first 'bytecount' bytes of the bitmap addr.
Does this actually make a noticeable difference? Rewriting stuff in assembly for theoretical performance improvements isn't so great, as far fewer people can read/write assembly than C.
thanks -mike
On Thu, Feb 10, 2005 at 08:12:39PM +0000, Mike Hearn wrote:
On Thu, 10 Feb 2005 18:59:21 +0100, Dietrich Teickner wrote:
I have a suggestion for a faster implementation of the zero_bit_scan in RtlFindClearBits [NTDLL.@] (rlbitmap.c) for e.g. TlsAlloc() The main is the usage of the instruction 'bsf eax, eax'
This I have implemented in the new experimental odinxp-tree for finding the first zero_bit in the first 'bytecount' bytes of the bitmap addr.
Does this actually make a noticeable difference? Rewriting stuff in assembly for theoretical performance improvements isn't so great, as far fewer people can read/write assembly than C.
I'd also add that you need to check that using 'bsf' is EVER a gain! An i386 might execute it faster than the corresponding C, but there is no guarantee that a P4i/Athlon will. Oh, and you need to do any tests with the code out of the cache.
David