Mainly looking for comment on global vs per device. Windows lacks an equivalent of pthread's callback for freeing memory, so we would need to track devices and free cache memory when the last is released.
I don't think there's necessarily a problem with using per-device caches, although it seems slightly harder than the global option at first sight. If the issue is purely getting notified about thread exit on Windows: DllMain() will get called with DLL_THREAD_DETACH on thread exit on Windows.