Well, it depends on the specifics, I think. In particular:
- How long does a d3d12_desc_flush_vk_heap_updates_locked() call typically take? How long does it maximally take? If the answer to the previous question is some approximation of "infinity", could we put a lower bound on that by e.g. limiting the maximum number of descriptors we process in a single d3d12_desc_flush_vk_heap_updates_locked() call?
- How long does waiting for a fence typically take once we know it has been submitted? Would it be terrible to poll fences with e.g. a 1ms timeout?
- What is the worst case behaviour? If descriptor writes were to get stuck behind a fence we'd need to wait for d3d12_command_queue_ExecuteCommandLists() to process them, but that should be no worse than what we're currently doing. The reverse might be worse, but we should be able to avoid that by polling fences inside d3d12_desc_flush_vk_heap_updates_locked() if needed.
- Are there any nasty edge cases?
My problem with this approach is that it depends on a lot of magic constants which would require tuning, and this tuning depends on the computer, on the game, possibly on the specific scene of the game, on the game settings and possibly many other factors. Polling at 1 ms can be nothing, if for some random reason fences are immediately ready, or it can be a lot, if you have 16 ms of budget per frame and waste 5 of them just polling for a handful of fences that manage to block every other operation. Getting into this sort of business to save on a thread per device doesn't seem ideal to me, though I'll admit you have more experience than me.