Spinning is the big performance killer. That seems to be the case for mutexes too because entry uses spinlocking. I see no measurable performance gain from a 64-byte alignment,
I imagine at least part of that is due to the atomic operations on cache->next_index and cache->free_count in vkd3d_desc_object_cache_get() and vkd3d_desc_object_cache_push().
I did some measurements with Cyberpunk 2077 to see how many times we need to spin (i.e., execute the `for` loop) on average for each call to `vkd3d_desc_object_cache_get()`. Results seem to be good: the ratio never reaches 2. It starts at 1, then grows a bit towards 1.5-1.6, then it decreases back seemingly converging to 1. That means that after some transient we basically never spin more than once for each call to `vkd3d_desc_object_get()`.
So it essentially get rid of the contention; that's great to know.
I think the MR is already good enough to be accepted. Further optimization like the cache size or thread-local caches could be considered in the future if some more performance has to be squeezed (though I wouldn't oppose to having them immediately if anybody wants to implement them right away).
I think it's an improvement too, so I'll approve this. I do think there's further room for improvement though, both in terms of performance and in terms of code quality, and I'd prefer seeing those sooner rather than later. (E.g., I don't like the magic "16"; I don't like that we're rolling our own spinlocks here; I don't like the number of atomic operations in what's supposed to be a hot path.)