Re: [PATCH v4 0/1] MR297: vkd3d: Fix invalid atomic behaviour in the view cache linked list.

16 Aug 2023


      On Wed Aug 16 02:00:17 2023 +0000, Henri Verbeet wrote:
...
...
Making caches local to the thread has the potential for some thorny
edge cases where one thread creates more than it frees, and another does
the opposite (on copied descriptors), so its cache grows until the
system is out of memory.
Right, thread local caches would need occasional rebalancing against a
global cache. The nice thing about them is that they're essentially
wait-free though, and even with such worst case behaviour you'd have
less contention than with just the global cache.
In principle we could actually get per-CPU caches on Linux using RSEQs
(restartable sequences), but unfortunately I'm not aware of any
equivalent Win32 mechanism.
...
The new implementation is much simpler than the 128-bit CAS version
and has about the same performance. It's somewhat similar to the old
mutex array scheme.
Yeah, conceptually I like this much better. (And fwiw, that's a fairly
standard scheme usually referred to as mutex/lock striping.) This does
still have quite a number of atomic operations in the hot path though.
...
Unfortunately C doesn't (as far as I know) offer a portable way to
query the cache line size at compilation time ([as C++17
does](https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_s...)).
[Experimenting a little bit with the compiler
explorer](https://godbolt.org/z/fEe4rK74s) it seems that most
architectures are either 32 or 64 bytes, with PowerPC being 128 bytes
and ARM64 possibly even 265 bytes. Given that we mostly care about Intel
and ARM, I guess that we can just settle for 64, but 256 for ARM64.
Recent gcc versions have __GCC_DESTRUCTIVE_SIZE. That's not portable of
course, but we could easily do something along the lines of
#ifdef __GCC_DESTRUCTIVE_SIZE
# define VKD3D_DESTRUCTIVE_SIZE __GCC_DESTRUCTIVE_SIZE
#elif ...
# define VKD3D_DESTRUCTIVE_SIZE ...
#else
# define VKD3D_DESTRUCTIVE_SIZE 64
#endif

Spinning is the big performance killer. That seems to be the case for mutexes too because entry uses spinlocking. I see no measurable performance gain from a 64-byte alignment, but there is always the chance of gains on other hardware. FWIW the old 128-bit CAS implementation was only very slightly slower than this despite using a single atomic value.
-- 
https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/297#note_42420

2025

2024

2023

2022

Re: [PATCH v4 0/1] MR297: vkd3d: Fix invalid atomic behaviour in the view cache linked list.