I think it's an improvement too, so I'll approve this. I do think there's further room for improvement though, both in terms of performance and in terms of code quality, and I'd prefer seeing those sooner rather than later. (E.g., I don't like the magic "16"; I don't like that we're rolling our own spinlocks here; I don't like the number of atomic operations in what's supposed to be a hot path.)
I gave this some more thinking, and I'm not sure I like the idea of using spinlocks (either implemented by us or by others) any more. They're not wait-free and they don't even coordinate with the operating system, meaning that if a thread is suspended while a spinlock is hold any other thread trying to acquire the same spinlock will spin busily for an entire scheduling quantum (or more). In our case that's slightly different because there is striping, but there are still scenarios in which that can fail (depending on the number of active threads, CPUs and stripe buckets), so I don't like it. While I understand the engineering problems of the wait-free option with the CPU-specific code, I can't help but thinking that after some initial investment that's going to remove some opportunities for stuttering that may be even harder to reproduce and debug later.
I don't particularly like the spinlocks either, but the striping at least somewhat mitigates the issues here. As mentioned before, I think the more problematic parts here are "next_index" and "free_count", which I'd expect to bounce all over the place in the cases we care about. The lock-free list wouldn't really avoid that issue either; it would have similar issues with the list head. The main benefit of thread-local schemes would be that it keeps data local to the thread as much as possible. Or that's the theory anyway; we'd also want some careful benchmarking to be done...