I think it's an improvement too, so I'll approve this. I do think there's further room for improvement though, both in terms of performance and in terms of code quality, and I'd prefer seeing those sooner rather than later. (E.g., I don't like the magic "16"; I don't like that we're rolling our own spinlocks here; I don't like the number of atomic operations in what's supposed to be a hot path.)
I gave this some more thinking, and I'm not sure I like the idea of using spinlocks (either implemented by us or by others) any more. They're not wait-free and they don't even coordinate with the operating system, meaning that if a thread is suspended while a spinlock is hold any other thread trying to acquire the same spinlock will spin busily for an entire scheduling quantum (or more). In our case that's slightly different because there is striping, but there are still scenarios in which that can fail (depending on the number of active threads, CPUs and stripe buckets), so I don't like it. While I understand the engineering problems of the wait-free option with the CPU-specific code, I can't help but thinking that after some initial investment that's going to remove some opportunities for stuttering that may be even harder to reproduce and debug later.