Faster TlsAlloc() or zero_bit_scan
Hi, I have a suggestion for a faster implementation of the zero_bit_scan in RtlFindClearBits [NTDLL.@] (rlbitmap.c) for e.g. TlsAlloc() The main is the usage of the instruction 'bsf eax, eax' This I have implemented in the new experimental odinxp-tree for finding the first zero_bit in the first 'bytecount' bytes of the bitmap addr. Dietrich public _search_zero_bit ;int CDECL search_zero_bit(int bytecount, void *addr); _search_zero_bit proc near push esi push edx mov esi, [esp+16] mov edx, [esp+12] inc edx loop: add esi, 4 mov eax, -1 dec edx ;not found jz short found xor eax, dword ptr [esi - 4] jz loop bsf eax, eax jz short loop sub esi, 4 sub esi, [esp+16] shl esi, 5-2 add eax, esi found: pop edx pop esi ret _search_zero_bit endp
On Thu, 10 Feb 2005 18:59:21 +0100, Dietrich Teickner wrote:
I have a suggestion for a faster implementation of the zero_bit_scan in RtlFindClearBits [NTDLL.@] (rlbitmap.c) for e.g. TlsAlloc() The main is the usage of the instruction 'bsf eax, eax'
This I have implemented in the new experimental odinxp-tree for finding the first zero_bit in the first 'bytecount' bytes of the bitmap addr.
Does this actually make a noticeable difference? Rewriting stuff in assembly for theoretical performance improvements isn't so great, as far fewer people can read/write assembly than C. thanks -mike
On Thu, Feb 10, 2005 at 08:12:39PM +0000, Mike Hearn wrote:
On Thu, 10 Feb 2005 18:59:21 +0100, Dietrich Teickner wrote:
I have a suggestion for a faster implementation of the zero_bit_scan in RtlFindClearBits [NTDLL.@] (rlbitmap.c) for e.g. TlsAlloc() The main is the usage of the instruction 'bsf eax, eax'
This I have implemented in the new experimental odinxp-tree for finding the first zero_bit in the first 'bytecount' bytes of the bitmap addr.
Does this actually make a noticeable difference? Rewriting stuff in assembly for theoretical performance improvements isn't so great, as far fewer people can read/write assembly than C.
I'd also add that you need to check that using 'bsf' is EVER a gain! An i386 might execute it faster than the corresponding C, but there is no guarantee that a P4i/Athlon will. Oh, and you need to do any tests with the code out of the cache. David -- David Laight: david(a)l8s.co.uk
participants (3)
-
David Laight -
Dietrich Teickner -
Mike Hearn