Pretty much the same kind of thing Proton does, this is only for non-shared semaphores for now.
This creates timeline events backed by host timeline semaphores, for every wait and signal operation on a client timeline semaphore. Events are host timeline semaphores with monotonically increasing values. One event is a `(host semaphore, value)` unique tuple, and their value is incremented every time they get signaled. They can be reused right away as a different event, with the new value. Signaled events are queued to a per-device event list for reuse, as we cannot safely destroy them [^1] and as creating them might be costly.
A thread is spawned for every device that uses timeline semaphores, monitoring the events and semaphore value changes. The semaphore wrapper keeps the current client semaphore value, which is read directly by `vkGetSemaphoreCounterValue`.
CPU and GPU waits on the client semaphore are swapped with a wait on a timeline event, and GPU signals with a signal on a timeline event. CPU signals simply update the client semaphore value on the wrapper, signaling the timeline thread to check and notify any waiter. The timeline thread waits on signal events, coming from the GPU, and on a per-device semaphore for CPU notifications (wait/signal list updates, CPU-side signal). It will then signal any pending wait for a client semaphore value that has been reached, and removes timed out waits.
---
For shared semaphores my idea is to use the client timeline semaphore host handle itself as a signal to notify other device threads of semaphore value changes, as the host semaphore is what is exported and imported in other devices. The timeline threads would wait on that semaphore too in addition to the signal events and thread notification semaphore.
The main issue with shared semaphores actually comes from sharing the client semaphore values, so that they can be read from each device timeline threads as well as from other processes. There's two scenarios to support, one that is for in-process sharing which could perhaps keep the information locally, but we also need to support cross-process sharing so I think there's no other way than to involve wineserver.
My idea then is to move shared semaphore client value to wineserver, which will cost a request on every signal and wait (and reading the value for waits could later be moved to a shared memory object). This will also allow us to implement `D3DKMTWaitForSynchronizationObjectFromCpu` which might be useful for D3D12 fence implementation as it'll allow us to translate a timeline semaphore wait to an asynchronous NT event signal.
[^1]: Vulkan spec indicates that semaphores may only be destroyed *after* every operation that uses them has fully executed, and it's unclear whether signaling or waiting a semaphore is enough as an indicator for full execution or whether we would need to wait on a submit fence.