I previously talked about parts of this with Giovanni on IRC, but the productive thing to do is probably to continue that conversation here.
I have a fairly strong dislike for the direction this is going in. In particular, in no specific order:
- Implementing lock-free data structures correctly is notoriously hard, and we're probably seeing an example of that here. Perhaps more importantly, while implementing these correctly may be hard, reviewing the code is generally even harder.
- Adding inline assembly and architecture specific code doesn't help.
- Neither does inlining the linked list implementation in the device and object cache code.
- If the issue with using a regular mutex or even a spinlock is contention, perhaps we should try to address that, instead of attempting to make the synchronisation primitives faster. (Do these caches need to be global to the device? Could we e.g. make them local to the CPU core or thread accessing them?)
- In the case of CBVs in particular, given the number of them that applications appear to create and destroy per frame, as well as the fact that these are fairly small structures, allocating the individually using vkd3d_malloc() seems less than ideal. (I.e., I imagine we'd want to use slab allocation for these in order to improve both locality and allocation overhead.)