The last commit fixes glitches in Horizon Zero Dawn with vsync on, for reasons stated in the comment there.
The wined3d implementation already blocks the client's call to `Present()` when `vkAcquireNextImageKHR()` blocks, so is roughly equivalent to this when frame latency and buffer count are equal. It has a frame latency implementation which unblocks after calling `vkQueuePresentKHR()`, rather than waiting for its execution to complete. I think the d3d12 implementation is supposed to do the latter, though using `KHR_present_wait` may be unnecessary overkill. Waiting on queue execution comes with a performance cost either way, so without a compelling reason for a strict implementation, something similar to wined3d makes sense.
Another potential issue is after the call to `ID3D12CommandQueue_Signal()` in `d3d12_swapchain_present()`, the client is free to enqueue more commands when `Present()` returns, so by the time the signal is executed and the swapchain worker blits the image and calls `vkQueuePresentKHR()`, these commands are not necessarily enqueued immediately after the `Signal()`. This may not always be harmless.
From: Conor McCarthy cmccarthy@codeweavers.com
--- dlls/dxgi/swapchain.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/dlls/dxgi/swapchain.c b/dlls/dxgi/swapchain.c index 4797fcecd8d..d68788b8fc5 100644 --- a/dlls/dxgi/swapchain.c +++ b/dlls/dxgi/swapchain.c @@ -1214,7 +1214,7 @@ struct d3d12_swapchain_op { unsigned int sync_interval; VkImage vk_image; - unsigned int frame_number; + uint64_t frame_number; } present; struct {
From: Conor McCarthy cmccarthy@codeweavers.com
It is generally good practice to flush all work, and this avoids the need to be certain that only one op can be queued per wake. --- dlls/dxgi/swapchain.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/dlls/dxgi/swapchain.c b/dlls/dxgi/swapchain.c index d68788b8fc5..ca7e89e1e56 100644 --- a/dlls/dxgi/swapchain.c +++ b/dlls/dxgi/swapchain.c @@ -1258,7 +1258,13 @@ static DWORD WINAPI d3d12_swapchain_worker_proc(void *data)
while (swapchain->worker_running) { - if (!list_empty(&swapchain->worker_ops)) + if (!SleepConditionVariableCS(&swapchain->worker_cv, &swapchain->worker_cs, INFINITE)) + { + ERR("Cannot sleep on condition variable, last error %ld.\n", GetLastError()); + break; + } + + while (swapchain->worker_running && !list_empty(&swapchain->worker_ops)) { struct d3d12_swapchain_op *op = LIST_ENTRY(list_head(&swapchain->worker_ops), struct d3d12_swapchain_op, entry);
@@ -1281,11 +1287,6 @@ static DWORD WINAPI d3d12_swapchain_worker_proc(void *data)
EnterCriticalSection(&swapchain->worker_cs); } - else if (!SleepConditionVariableCS(&swapchain->worker_cv, &swapchain->worker_cs, INFINITE)) - { - ERR("Cannot sleep on condition variable, last error %ld.\n", GetLastError()); - break; - } }
LeaveCriticalSection(&swapchain->worker_cs);
From: Conor McCarthy cmccarthy@codeweavers.com
Microsoft recommends waking after unlocking for performance reasons. That may or may not apply here but it is good practice. --- dlls/dxgi/swapchain.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/dlls/dxgi/swapchain.c b/dlls/dxgi/swapchain.c index ca7e89e1e56..413935db869 100644 --- a/dlls/dxgi/swapchain.c +++ b/dlls/dxgi/swapchain.c @@ -2346,8 +2346,8 @@ static HRESULT d3d12_swapchain_present(struct d3d12_swapchain *swapchain,
EnterCriticalSection(&swapchain->worker_cs); list_add_tail(&swapchain->worker_ops, &op->entry); - WakeAllConditionVariable(&swapchain->worker_cv); LeaveCriticalSection(&swapchain->worker_cs); + WakeAllConditionVariable(&swapchain->worker_cv);
++swapchain->frame_number; if ((frame_latency_event = swapchain->frame_latency_event))
From: Conor McCarthy cmccarthy@codeweavers.com
--- dlls/dxgi/swapchain.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/dlls/dxgi/swapchain.c b/dlls/dxgi/swapchain.c index 413935db869..3a1650103e8 100644 --- a/dlls/dxgi/swapchain.c +++ b/dlls/dxgi/swapchain.c @@ -2631,8 +2631,8 @@ static HRESULT d3d12_swapchain_resize_buffers(struct d3d12_swapchain *swapchain,
EnterCriticalSection(&swapchain->worker_cs); list_add_tail(&swapchain->worker_ops, &op->entry); - WakeAllConditionVariable(&swapchain->worker_cv); LeaveCriticalSection(&swapchain->worker_cs); + WakeAllConditionVariable(&swapchain->worker_cv);
swapchain->current_buffer_index = 0;
From: Conor McCarthy cmccarthy@codeweavers.com
If the timeout is UINT64_MAX, vkAcquireNextImageKHR() blocks until the next image is acquired. Waiting on a fence is unnecessary. --- dlls/dxgi/swapchain.c | 27 +-------------------------- 1 file changed, 1 insertion(+), 26 deletions(-)
diff --git a/dlls/dxgi/swapchain.c b/dlls/dxgi/swapchain.c index 3a1650103e8..78cd2697b75 100644 --- a/dlls/dxgi/swapchain.c +++ b/dlls/dxgi/swapchain.c @@ -1138,7 +1138,6 @@ struct d3d12_swapchain struct wined3d_swapchain_state_parent state_parent;
VkSurfaceKHR vk_surface; - VkFence vk_fence; VkInstance vk_instance; VkDevice vk_device; VkPhysicalDevice vk_physical_device; @@ -1754,25 +1753,15 @@ static VkResult d3d12_swapchain_acquire_next_vulkan_image(struct d3d12_swapchain { const struct dxgi_vk_funcs *vk_funcs = &swapchain->vk_funcs; VkDevice vk_device = swapchain->vk_device; - VkFence vk_fence = swapchain->vk_fence; VkResult vr;
swapchain->vk_image_index = INVALID_VK_IMAGE_INDEX;
if ((vr = vk_funcs->p_vkAcquireNextImageKHR(vk_device, swapchain->vk_swapchain, UINT64_MAX, - VK_NULL_HANDLE, vk_fence, &swapchain->vk_image_index)) < 0) + VK_NULL_HANDLE, VK_NULL_HANDLE, &swapchain->vk_image_index)) < 0) { WARN("Failed to acquire next Vulkan image, vr %d.\n", vr); - return vr; - } - - if ((vr = vk_funcs->p_vkWaitForFences(vk_device, 1, &vk_fence, VK_TRUE, UINT64_MAX)) != VK_SUCCESS) - { - ERR("Failed to wait for fence, vr %d.\n", vr); - return vr; } - if ((vr = vk_funcs->p_vkResetFences(vk_device, 1, &vk_fence)) < 0) - ERR("Failed to reset fence, vr %d.\n", vr);
return vr; } @@ -2068,7 +2057,6 @@ static void d3d12_swapchain_destroy(struct d3d12_swapchain *swapchain)
if (swapchain->vk_device) { - vk_funcs->p_vkDestroyFence(swapchain->vk_device, swapchain->vk_fence, NULL); vk_funcs->p_vkDestroySwapchainKHR(swapchain->vk_device, swapchain->vk_swapchain, NULL); }
@@ -3183,14 +3171,12 @@ static HRESULT d3d12_swapchain_init(struct d3d12_swapchain *swapchain, IWineDXGI VkWin32SurfaceCreateInfoKHR surface_desc; VkPhysicalDevice vk_physical_device; struct dxgi_factory *dxgi_factory; - VkFenceCreateInfo fence_desc; uint32_t queue_family_index; VkSurfaceKHR vk_surface; VkInstance vk_instance; IDXGIOutput *output; VkBool32 supported; VkDevice vk_device; - VkFence vk_fence; bool fullscreen; VkResult vr; HRESULT hr; @@ -3342,17 +3328,6 @@ static HRESULT d3d12_swapchain_init(struct d3d12_swapchain *swapchain, IWineDXGI return hr; }
- fence_desc.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO; - fence_desc.pNext = NULL; - fence_desc.flags = 0; - if ((vr = vk_funcs->p_vkCreateFence(vk_device, &fence_desc, NULL, &vk_fence)) < 0) - { - WARN("Failed to create Vulkan fence, vr %d.\n", vr); - d3d12_swapchain_destroy(swapchain); - return hresult_from_vk_result(vr); - } - swapchain->vk_fence = vk_fence; - swapchain->current_buffer_index = 0;
if (swapchain_desc->Flags & DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT)
From: Conor McCarthy cmccarthy@codeweavers.com
--- dlls/dxgi/swapchain.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/dlls/dxgi/swapchain.c b/dlls/dxgi/swapchain.c index 78cd2697b75..b1438eb97b7 100644 --- a/dlls/dxgi/swapchain.c +++ b/dlls/dxgi/swapchain.c @@ -1248,6 +1248,7 @@ static void d3d12_swapchain_op_destroy(struct d3d12_swapchain *swapchain, struct
static HRESULT d3d12_swapchain_op_present_execute(struct d3d12_swapchain *swapchain, struct d3d12_swapchain_op *op); static HRESULT d3d12_swapchain_op_resize_buffers_execute(struct d3d12_swapchain *swapchain, struct d3d12_swapchain_op *op); +static VkResult d3d12_swapchain_acquire_next_vulkan_image(struct d3d12_swapchain *swapchain);
static DWORD WINAPI d3d12_swapchain_worker_proc(void *data) { @@ -1269,6 +1270,17 @@ static DWORD WINAPI d3d12_swapchain_worker_proc(void *data)
list_remove(&op->entry);
+ if (op->type == D3D12_SWAPCHAIN_OP_PRESENT && swapchain->vk_image_index == INVALID_VK_IMAGE_INDEX) + { + /* vkAcquireNextImageKHR() will block if an image is not available. Let it block inside + * the critical section to delay a new present op being queued, otherwise the client will + * continue rendering and overwrite a pending image. Doing this at least approximates + * D3D12 frame latency behaviour if the frame latency equals the buffer count. + * TODO: implement frame latency when it differs from the buffer count. + * Do not bother handling failure here, as vk_image_index remains invalid in that case. */ + d3d12_swapchain_acquire_next_vulkan_image(swapchain); + } + LeaveCriticalSection(&swapchain->worker_cs);
switch (op->type)
!5851 will make this redundant if it goes ahead.
Wrt 2/6, don't we already effectively flush the whole list? Note that the sleep is inside an "else if", so if the list was non-empty we'll check it again immediately.
I think it's also incorrect to sleep immediately; we need to check the CV predicate first. (Consider if an op is queued before the thread even executes.)
The rest of this series makes sense to me, I think, although I'd appreciate Giovanni's review since he's touched this area recently and will probably notice something I didn't.
On Mon Jun 17 10:52:12 2024 +0000, Elizabeth Figura wrote:
Wrt 2/6, don't we already effectively flush the whole list? Note that the sleep is inside an "else if", so if the list was non-empty we'll check it again immediately. I think it's also incorrect to sleep immediately; we need to check the CV predicate first. (Consider if an op is queued before the thread even executes.) The rest of this series makes sense to me, I think, although I'd appreciate Giovanni's review since he's touched this area recently and will probably notice something I didn't.
Oops, yeah that's true. I'll update it if !5851 doesn't go ahead.
Giovanni Mascellani (@giomasce) commented about dlls/dxgi/swapchain.c:
{ const struct dxgi_vk_funcs *vk_funcs = &swapchain->vk_funcs; VkDevice vk_device = swapchain->vk_device;
VkFence vk_fence = swapchain->vk_fence; VkResult vr;
swapchain->vk_image_index = INVALID_VK_IMAGE_INDEX;
if ((vr = vk_funcs->p_vkAcquireNextImageKHR(vk_device, swapchain->vk_swapchain, UINT64_MAX,
VK_NULL_HANDLE, vk_fence, &swapchain->vk_image_index)) < 0)
VK_NULL_HANDLE, VK_NULL_HANDLE, &swapchain->vk_image_index)) < 0)
I'm not sure where you're reading that if `timeout` is `UINT64_MAX` then `vkAcquireNextImageKHR()` blocks until the next image is acquired, but [the Vulkan specs explicitly forbids the semaphore and fence to be null at the same time](https://registry.khronos.org/vulkan/specs/1.3-extensions/html/chap34.html#VU...), so I don't think this is a good solution. It's probably true that we shouldn't wait on the fence, but rather use the semaphore to synchronize directly with `vkQueuePresentKHR()` (and leave the synchronization inside the GPU, without involving the CPU). My recalling of last year is that should be pretty doable now, but I didn't embark in actually doing it, so there might be some details I'm missing.
For context, my understanding of `vkAcquireNextImageKHR()` is that you have to wait both on the call itself and on the fence or semaphore that is provided. That's because calling the function directly will wait until the Vulkan implementation is able to tell which image you're going to use, but it doesn't guarantee that that image can be used at once. It might still be used for presenting an earlier frame. So you don't have to touch it until the fence or semaphore is signaled (which is the reason why at least one must be provided: otherwise you'll never know when you can use that image).
Giovanni Mascellani (@giomasce) commented about dlls/dxgi/swapchain.c:
EnterCriticalSection(&swapchain->worker_cs); list_add_tail(&swapchain->worker_ops, &op->entry);
- WakeAllConditionVariable(&swapchain->worker_cv); LeaveCriticalSection(&swapchain->worker_cs);
- WakeAllConditionVariable(&swapchain->worker_cv);
I'm not against this. The opinions on whether that's good for performance varies across the internet and the answer ultimately depends on the underlying implementation, which Zeb probably knows better than me. But did you measure an actual performance gain with that change?
On Mon Jun 17 10:52:12 2024 +0000, Conor McCarthy wrote:
Oops, yeah that's true. I'll update it if !5851 doesn't go ahead.
Agreed, you always have to re-check the condition before sleeping. Honestly I don't like any more the way I wrote that loop, but it doesn't seem wrong. I think my current preferred boilerplate for a loop like that is: ```c enter_cs(); for (;;) { if (!work && !stop) wait();
if (stop) break;
do_work(); // Possibly leaving and re-entering the CS } leave_cs(); ```
Also re-entering the CS for each work item is probably not optimal in general, though probably work frequency here is sufficiently low to not matter too much. That's another matter, anyway.
Giovanni Mascellani (@giomasce) commented about dlls/dxgi/swapchain.c:
list_remove(&op->entry);
if (op->type == D3D12_SWAPCHAIN_OP_PRESENT && swapchain->vk_image_index == INVALID_VK_IMAGE_INDEX)
{
/* vkAcquireNextImageKHR() will block if an image is not available. Let it block inside
* the critical section to delay a new present op being queued, otherwise the client will
* continue rendering and overwrite a pending image. Doing this at least approximates
* D3D12 frame latency behaviour if the frame latency equals the buffer count.
* TODO: implement frame latency when it differs from the buffer count.
* Do not bother handling failure here, as vk_image_index remains invalid in that case. */
d3d12_swapchain_acquire_next_vulkan_image(swapchain);
}
I might be misunderstanding what this is supposed to do, but it doesn't look particularly solid. It seems that you want to somehow delay some `Present()` to return to the caller, so that the caller itself doesn't assume that some buffer is valid, but if that's the problem simply blocking while acquiring a Vulkan image is not going to help. There is not other synchronization between the queue producer and consumer, so as soon as the caller calls `Present()` a little bit in advance (while the queue is still processing the earlier present request) it can enqueue all the `Present()`'s it wants. Therefore the fact that `Present()` blocks or not just depends on some timing detail, and that doesn't look like a solution to anything.
More in general, my understanding of the philosophy behind the D3D12 swapchain is that we want to keep a fair amount of separation between the D3D12 and Vulkan swapchain images, because the rules governing them make it hard to do otherwise. That's unfortunate performance-wise, because it requires an additional blit for each presented frame, but it enables us to (potentially, there are surely bugs around) offer the same behavior that is seen on Windows. For example AFAICT it's totally fine, even if unusual, to request a swapchain with 10 buffers (I seem to recall that Windows will start complaining at 16, but don't quote me on that). Even if our Vulkan has 3 images, we still can emulate the 10 D3D12 buffers, because we keep not relationship between the two sets. The application will happily paint to the D3D12 buffer, then when one of them has to be presented we'll only then select a Vulkan image to blit the D3D12 buffer to and then present it. So there must be no link between the avai lability of Vulkan and D3D12 buffers/images. If we want to implement a certain behavioral element of the D3D12 swapchain we should do it without depending on how the Vulkan swapchain works.
On Mon Jun 17 11:12:08 2024 +0000, Giovanni Mascellani wrote:
I'm not sure where you're reading that if `timeout` is `UINT64_MAX` then `vkAcquireNextImageKHR()` blocks until the next image is acquired, but [the Vulkan specs explicitly forbids the semaphore and fence to be null at the same time](https://registry.khronos.org/vulkan/specs/1.3-extensions/html/chap34.html#VU...), so I don't think this is a good solution. It's probably true that we shouldn't wait on the fence, but rather use the semaphore to synchronize directly with `vkQueuePresentKHR()` (and leave the synchronization inside the GPU, without involving the CPU). My recalling of last year is that should be pretty doable now, but I didn't embark in actually doing it, so there might be some details I'm missing. For context, my understanding of `vkAcquireNextImageKHR()` is that you have to wait both on the call itself and on the fence or semaphore that is provided. That's because calling the function directly will wait until the Vulkan implementation is able to tell which image you're going to use, but it doesn't guarantee that that image can be used at once. It might still be used for presenting an earlier frame. So you don't have to touch it until the fence or semaphore is signaled (which is the reason why at least one must be provided: otherwise you'll never know when you can use that image).
I stand corrected on a detail: the semaphore signaled by `vkAcquireNextImageKHR()` mustn't be waited by `vkQueuePresentKHR()`, but by the blitting operation.
On Mon Jun 17 11:12:09 2024 +0000, Giovanni Mascellani wrote:
I might be misunderstanding what this is supposed to do, but it doesn't look particularly solid. It seems that you want to somehow delay some `Present()` to return to the caller, so that the caller itself doesn't assume that some buffer is valid, but if that's the problem simply blocking while acquiring a Vulkan image is not going to help. There is not other synchronization between the queue producer and consumer, so as soon as the caller calls `Present()` a little bit in advance (while the queue is still processing the earlier present request) it can enqueue all the `Present()`'s it wants. Therefore the fact that `Present()` blocks or not just depends on some timing detail, and that doesn't look like a solution to anything. More in general, my understanding of the philosophy behind the D3D12 swapchain is that we want to keep a fair amount of separation between the D3D12 and Vulkan swapchain images, because the rules governing them make it hard to do otherwise. That's unfortunate performance-wise, because it requires an additional blit for each presented frame, but it enables us to (potentially, there are surely bugs around) offer the same behavior that is seen on Windows. For example AFAICT it's totally fine, even if unusual, to request a swapchain with 10 buffers (I seem to recall that Windows will start complaining at 16, but don't quote me on that). Even if our Vulkan has 3 images, we still can emulate the 10 D3D12 buffers, because we keep not relationship between the two sets. The application will happily paint to the D3D12 buffer, then when one of them has to be presented we'll only then select a Vulkan image to blit the D3D12 buffer to and then present it. So there must be no link between the availability of Vulkan and D3D12 buffers/images. If we want to implement a certain behavioral element of the D3D12 swapchain we should do it without depending on how the Vulkan swapchain works.
I probably need to improve the comment. With vsync on, the driver delays present commands until the next vblank, which can easily cause presents to accumulate until all three images are pending, at which point `vkAcquireNextImageKHR()` will block. But because the block occurs in a worker, our `Present()` implementation will happily add more ops to the list without limit. `Present()` must block for the same reason `vkAcquireNextImageKHR()` blocks - to wait until an earlier present meets a vblank and completes. This does not solve the problem of frame latency, which should be patched separately.
If rendering is very fast compared to sync rate, it's theoretically possible for any number of presents to be added to the list while the worker leaves the mutex unlocked. A frame latency implementation is probably the answer to that issue.
On Mon Jun 17 11:12:08 2024 +0000, Giovanni Mascellani wrote:
I'm not against this. The opinions on whether that's good for performance varies across the internet and the answer ultimately depends on the underlying implementation, which Zeb probably knows better than me. But did you measure an actual performance gain with that change?
Without doing any research myself... the other side is going to try to enter the logic immediately [within SleepConditionVariableCS()] assuming it's waiting, and I think usually wakes are fast enough [at least on Linux] that the other thread will generally start executing before you even return from the wake function.
Besides a poorly communicated warning that doesn't apply here, the first comment in [1] also claims that it defeats a "wait morphing" optimization, where instead of actually waking threads they're just moved from the CV waitqueue to the mutex wait queue [if their associated mutex is locked [by the calling thread?]]. That's not something we currently implement in Wine; it may be possible to implement later though...
[1] https://stackoverflow.com/questions/52503361/unlock-the-mutex-after-conditio...
On Mon Jun 17 14:15:30 2024 +0000, Conor McCarthy wrote:
I probably need to improve the comment. With vsync on, the driver delays present commands until the next vblank, which can easily cause presents to accumulate until all three images are pending, at which point `vkAcquireNextImageKHR()` will block. But because the block occurs in a worker, our `Present()` implementation will happily add more ops to the list without limit. `Present()` must block for the same reason `vkAcquireNextImageKHR()` blocks - to wait until an earlier present meets a vblank and completes. This does not solve the problem of frame latency, which should be patched separately. If rendering is very fast compared to sync rate, it's theoretically possible for any number of presents to be added to the list while the worker leaves the mutex unlocked. A frame latency implementation is probably the answer to that issue.
`Present()` doesn't necessarily have to wait if the user requested more buffers than there are images in the Vulkan swapchain. If the user requests 10 buffers (`DXGI_SWAP_CHAIN_DESC1::BufferCount`) but the Vulkan swapchain only supports three, after having submitted three buffers the Vulkan queue will be full and will have to wait for the first frame to finish presentation, but the D3D12 will still be able to hand out buffers until the count reaches 10.