I think this version should be much better, less clumsy and more performant.
I dropped the idea of thread-local subheaps entirely, as more testing shows that native LFH just creates LFH blocks in the middle of the other blocks. It makes them a bunch at a time, so I'm assuming it carves them out of a bigger non-LFH block and I'm doing the same here, allocating 31 (currently) blocks at a time.
I don't really know where native keeps the block group metadata, each LFH block seem to reference each other but there's still apparently some out-of-band metadata somewhere. For this implementation I've introduced block `group`s, which are simply standard blocks but instead kept internally in block size categories linked lists.
Each `group` has a small header to keep out-of-band data such as its group list entry and a free block bitmap, and its data is then split into smaller LFH blocks, each with the same usual heap block header but with a specific LFH flag.
Instead of the large and complex thread local data, and per-thread subheaps, I'm only giving each thread that tried to allocate a block an affinity, modulo some fixed limit. Each category uses an interlocked list to keep shared groups with free blocks, as well as affinity-based group pointers as an optimization over the interlocked list.
With both the affinity and the interlocked shared list there should not be too much contention, and only when new groups need to be allocated or are fully released, the heap lock needs to be taken. This avoids entering the CS as much as possible using atomic interlocked operations and careful thread ownership of block groups.