A TLS index is allocated for each device object. We could use a global TLS, but that would share caches for all devices. I'm not sure if that's a problem. One index per device means freed indices must be recycled, which requires a global mutex, but a global TLS index would also need a global mutex.
This currently leaks cache memory when a device is freed. To fix this, allocations must be tracked. A global cache is more difficult to free, which is not normally an issue, but it is when using valgrind.