https://bugs.winehq.org/show_bug.cgi?id=50292
--- Comment #17 from Zebediah Figura z.figura12@gmail.com --- (In reply to Jacek Caban from comment #16)
I meant it as a replacement, not an addition. I may have been an overly optimistic about just 2 dim array, 3 dim could be less wasteful (although there may be a better structure). The point is to make a single tid slot cheap enough (4 bytes seems more than enough, but it could be even larger if needed) so that we may over-allocate and never free. Over-allocation makes allocation of a single slot cheap (because it will usually be already allocated) and never freeing avoids any need for locking.
It may seem like a wasteful strategy, but note that server gives us a good locality of tid values and reuses previously freed ids. In practice, I'd expect that a single block of a few thousands entries should be enough for majority of applications. If we optimize for that (while being able to handle even a theoretical case of 2**30 created threads with enough allocations), I'd hope that we can get decent results. I can't be sure without trying, maybe I'm missing something.
Okay, that's good, that sounds like the ideal strategy to me too. I just wasn't sure if there was a reasonable tradeoff between maximum number of threads possible + amount of memory unnecessarily reserved + lookup time. (I am less than completely enthused about the idea of a 3-dimensional list—I think we should probably be optimizing for lookup time above all else.)
I think we do need 8 bytes, at least for 64-bit, or perhaps just for 64-bit Mac. A futex is 4, but on Mac we use semaphore_t, which is a pointer (though I haven't looked to see if there's a better solution), and if neither is available we use an event (we could possibly use a keyed event, though we'd have to introduce some locking—but we're already on the slow path).
Thinking out loud here: a two-dimensional array of 65536 bytes * 4096 entries would give us 2**26 threads on 32-bit, 2**25 on 64-bit, while using 16+(4 or 8) pages in the common case. 32-bit has a much lower limit imposed by the stack size, so that's not a concern, but is that enough threads for 64-bit? Raising that to 2**30 for 64-bit would be a reservation of 65536 entries = 128 pages, which is kind of a lot, though in practice only the first is ever faulted in and we have VA space to spare.
Does that sound like a reasonable compromise? I think we could even shrink the 32-bit reservations so to use less address space.