Right, we could probably limit this to cases where the semaphores are shared (and/or as D3D12 fences), as this is a practical case we need it for proper interop. However, and with the other points you mention I'm worried that it's not going to be possible to implement this properly.
The ordering problem for instance seems tricky to handle correctly. Some signals ordering might indeed be well defined on the GPU, but it becomes undefined as we wait and process signals in the order they were queued. I had this hunch before, and this seems to confirm it, that we won't be able to respect the ordering without tracking all the event dependencies, and that doesn't seem realistic.
In a similar fashion, I think it also introduces race conditions between the submission fences and the signaled semaphores: the fences are signaled by the GPU directly as soon as it has executed the submission and signaled all the events, but the thread might not have processed the signals yet and not have updated the CPU-visible semaphore values to the values that they should have.
This also seems difficult to solve, without swapping submit fences too, which would require tracking their semaphore dependencies, or adding ping-pongs between the timeline thread and the GPU signals to make sure the thread has processed all the signals before the GPU signals the fence.
Overall I'm starting to wonder whether it is a good idea to add all this logic now, and end up with a half assed implementation which will never be able to implement every thing we need properly. It seems to me that the only option here is to have host Vulkan driver support for rewinding.