On Wed Aug 16 02:00:17 2023 +0000, Henri Verbeet wrote:
Making caches local to the thread has the potential for some thorny
edge cases where one thread creates more than it frees, and another does the opposite (on copied descriptors), so its cache grows until the system is out of memory. Right, thread local caches would need occasional rebalancing against a global cache. The nice thing about them is that they're essentially wait-free though, and even with such worst case behaviour you'd have less contention than with just the global cache. In principle we could actually get per-CPU caches on Linux using RSEQs (restartable sequences), but unfortunately I'm not aware of any equivalent Win32 mechanism.
The new implementation is much simpler than the 128-bit CAS version
and has about the same performance. It's somewhat similar to the old mutex array scheme. Yeah, conceptually I like this much better. (And fwiw, that's a fairly standard scheme usually referred to as mutex/lock striping.) This does still have quite a number of atomic operations in the hot path though.
Unfortunately C doesn't (as far as I know) offer a portable way to
query the cache line size at compilation time ([as C++17 does](https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_s...)). [Experimenting a little bit with the compiler explorer](https://godbolt.org/z/fEe4rK74s) it seems that most architectures are either 32 or 64 bytes, with PowerPC being 128 bytes and ARM64 possibly even 265 bytes. Given that we mostly care about Intel and ARM, I guess that we can just settle for 64, but 256 for ARM64. Recent gcc versions have __GCC_DESTRUCTIVE_SIZE. That's not portable of course, but we could easily do something along the lines of
#ifdef __GCC_DESTRUCTIVE_SIZE # define VKD3D_DESTRUCTIVE_SIZE __GCC_DESTRUCTIVE_SIZE #elif ... # define VKD3D_DESTRUCTIVE_SIZE ... #else # define VKD3D_DESTRUCTIVE_SIZE 64 #endif
Spinning is the big performance killer. That seems to be the case for mutexes too because entry uses spinlocking. I see no measurable performance gain from a 64-byte alignment, but there is always the chance of gains on other hardware. FWIW the old 128-bit CAS implementation was only very slightly slower than this despite using a single atomic value.