The D3D12 spec guarantees that lists submitted in ExecuteCommandLists() will complete execution before any subsequent commands begin execution.
Based on a vkd3d-proton patch by Hans-Kristian Arntzen.
From: Conor McCarthy cmccarthy@codeweavers.com
The D3D12 spec guarantees that lists submitted in ExecuteCommandLists() will complete execution before any subsequent commands begin execution.
Based on a vkd3d-proton patch by Hans-Kristian Arntzen. --- libs/vkd3d/command.c | 61 ++++++++++++++++++++++++++++++++++++-- libs/vkd3d/vkd3d_private.h | 3 ++ 2 files changed, 62 insertions(+), 2 deletions(-)
diff --git a/libs/vkd3d/command.c b/libs/vkd3d/command.c index e5ead7d3..ff96ef52 100644 --- a/libs/vkd3d/command.c +++ b/libs/vkd3d/command.c @@ -33,7 +33,13 @@ HRESULT vkd3d_queue_create(struct d3d12_device *device, uint32_t family_index, const VkQueueFamilyProperties *properties, struct vkd3d_queue **queue) { const struct vkd3d_vk_device_procs *vk_procs = &device->vk_procs; + VkCommandBufferAllocateInfo allocate_info; + VkCommandPoolCreateInfo pool_create_info; + VkCommandBufferBeginInfo begin_info; + VkMemoryBarrier memory_barrier; struct vkd3d_queue *object; + VkResult vr; + HRESULT hr;
if (!(object = vkd3d_malloc(sizeof(*object)))) return E_OUTOFMEMORY; @@ -55,11 +61,55 @@ HRESULT vkd3d_queue_create(struct d3d12_device *device,
VK_CALL(vkGetDeviceQueue(device->vk_device, family_index, 0, &object->vk_queue));
+ /* Create a reusable full barrier command buffer. This is used in submissions + * to reproduce the guaranteed serialised behavior of D3D12 queues. */ + pool_create_info.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO; + pool_create_info.pNext = NULL; + pool_create_info.flags = 0; + pool_create_info.queueFamilyIndex = family_index; + if ((vr = VK_CALL(vkCreateCommandPool(device->vk_device, &pool_create_info, NULL, &object->barrier_pool))) < 0) + goto fail_destroy_mutex; + + allocate_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO; + allocate_info.pNext = NULL; + allocate_info.commandPool = object->barrier_pool; + allocate_info.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY; + allocate_info.commandBufferCount = 1; + if ((vr = VK_CALL(vkAllocateCommandBuffers(device->vk_device, &allocate_info, + &object->barrier_command_buffer))) < 0) + goto fail_free_command_pool; + + begin_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO; + begin_info.pNext = NULL; + /* Allow simultaneous use of this command buffer. */ + begin_info.flags = VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT; + begin_info.pInheritanceInfo = NULL; + if ((vr = VK_CALL(vkBeginCommandBuffer(object->barrier_command_buffer, &begin_info))) < 0) + goto fail_free_command_pool; + + /* To avoid unnecessary tracking, just emit a host barrier on every submit. */ + memory_barrier.sType = VK_STRUCTURE_TYPE_MEMORY_BARRIER; + memory_barrier.pNext = NULL; + memory_barrier.srcAccessMask = VK_ACCESS_MEMORY_WRITE_BIT; + memory_barrier.dstAccessMask = VK_ACCESS_MEMORY_READ_BIT | VK_ACCESS_HOST_READ_BIT; + VK_CALL(vkCmdPipelineBarrier(object->barrier_command_buffer, + VK_PIPELINE_STAGE_ALL_COMMANDS_BIT, + VK_PIPELINE_STAGE_ALL_COMMANDS_BIT | VK_PIPELINE_STAGE_HOST_BIT, 0, + 1, &memory_barrier, 0, NULL, 0, NULL)); + if ((vr = VK_CALL(vkEndCommandBuffer(object->barrier_command_buffer))) < 0) + goto fail_free_command_pool; + TRACE("Created queue %p for queue family index %u.\n", object, family_index);
*queue = object;
return S_OK; + +fail_free_command_pool: + VK_CALL(vkDestroyCommandPool(device->vk_device, object->barrier_pool, NULL)); +fail_destroy_mutex: + vkd3d_mutex_destroy(&object->mutex); + return hresult_from_vk_result(vr); }
void vkd3d_queue_destroy(struct vkd3d_queue *queue, struct d3d12_device *device) @@ -80,6 +130,8 @@ void vkd3d_queue_destroy(struct vkd3d_queue *queue, struct d3d12_device *device) VK_CALL(vkDestroySemaphore(device->vk_device, queue->old_vk_semaphores[i], NULL)); }
+ VK_CALL(vkDestroyCommandPool(device->vk_device, queue->barrier_pool, NULL)); + vkd3d_mutex_unlock(&queue->mutex);
vkd3d_mutex_destroy(&queue->mutex); @@ -6181,7 +6233,7 @@ static void STDMETHODCALLTYPE d3d12_command_queue_ExecuteCommandLists(ID3D12Comm if (!command_list_count) return;
- if (!(buffers = vkd3d_calloc(command_list_count, sizeof(*buffers)))) + if (!(buffers = vkd3d_calloc(command_list_count + 1, sizeof(*buffers)))) { ERR("Failed to allocate command buffer array.\n"); return; @@ -6202,6 +6254,11 @@ static void STDMETHODCALLTYPE d3d12_command_queue_ExecuteCommandLists(ID3D12Comm buffers[i] = cmd_list->vk_command_buffer; }
+ /* The lists submitted in a call to ExecuteCommandLists() are guaranteed to complete + * before execution begins of the next command submitted to the queue. Append a full + * GPU barrier between submissions. This command buffer has SIMULTANEOUS_BIT. */ + buffers[i++] = command_queue->vkd3d_queue->barrier_command_buffer; + vkd3d_mutex_lock(&command_queue->op_mutex);
if (!(op = d3d12_command_queue_op_array_require_space(&command_queue->op_queue))) @@ -6211,7 +6268,7 @@ static void STDMETHODCALLTYPE d3d12_command_queue_ExecuteCommandLists(ID3D12Comm } op->opcode = VKD3D_CS_OP_EXECUTE; op->u.execute.buffers = buffers; - op->u.execute.buffer_count = command_list_count; + op->u.execute.buffer_count = i;
d3d12_command_queue_submit_locked(command_queue);
diff --git a/libs/vkd3d/vkd3d_private.h b/libs/vkd3d/vkd3d_private.h index 1a277a47..a757f9c4 100644 --- a/libs/vkd3d/vkd3d_private.h +++ b/libs/vkd3d/vkd3d_private.h @@ -1327,6 +1327,9 @@ struct vkd3d_queue size_t semaphore_count;
VkSemaphore old_vk_semaphores[VKD3D_MAX_VK_SYNC_OBJECTS]; + + VkCommandPool barrier_pool; + VkCommandBuffer barrier_command_buffer; };
VkQueue vkd3d_queue_acquire(struct vkd3d_queue *queue);
What could happen without this? Could you give an example?
The original vkd3d-proton patch 1ade8c0cc referred to an AMD CACAO demo, "which does not emit a barrier between the AO compute passes and the tone mapping pass in the next command buffer." I assume it's this: https://gpuopen.com/fidelityfx-cacao/ But it crashes in RADV.
On Thu Apr 6 11:50:42 2023 +0000, Henri Verbeet wrote:
What could happen without this? Could you give an example?
My understanding from the "Remarks" section in https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12com... is this: suppose you create a command list that paints a resource all red and another one that paints the same texture all green. If you submit the two command lists with two separate `ExecuteCommandLists()` invokations, then you're sure the resource will be painted in whatever color is submitted last; if instead you submit with a single `ExecuteCommandLists()` invokation, then you don't know.
I am not sure what would be the weakest Vulkan memory barrier that guarantees whatever D3D12 guarantees, though, but my feeling is that we'd need a fairly strong one.
On Thu Apr 6 11:50:42 2023 +0000, Giovanni Mascellani wrote:
My understanding from the "Remarks" section in https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12com... is this: suppose you create a command list that paints a resource all red and another one that paints the same texture all green. If you submit the two command lists with two separate `ExecuteCommandLists()` invokations, then you're sure the resource will be painted in whatever color is submitted last; if instead you submit with a single `ExecuteCommandLists()` invokation, then you don't know. I am not sure what would be the weakest Vulkan memory barrier that guarantees whatever D3D12 guarantees, though, but my feeling is that we'd need a fairly strong one.
To add a side note: this is why [implicit state transitions](https://learn.microsoft.com/en-us/windows/win32/direct3d12/using-resource-ba...) work. No actual barriers are needed because writes are already guaranteed complete. In vkd3d we will need to add Vulkan layout transitions for these cases, but no memory barriers.
My understanding from the "Remarks" section in https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12com... is this: suppose you create a command list that paints a resource all red and another one that paints the same texture all green. If you submit the two command lists with two separate `ExecuteCommandLists()` invokations, then you're sure the resource will be painted in whatever color is submitted last;
Well, sure, but command buffers submitted to a Vulkan queue don't quite execute in random order either; submission order, primitive order, rasterisation order, etc. are all defined by the spec.
Conor's comment seems to suggest this may be about ordering compute and graphics commands, which indeed normally don't wait for each other in Vulkan. Handling that wouldn't quite require a full memory barrier after each submission though. (Incidentally, we have barriers in adapter_vk_dispatch_compute() and wined3d_context_vk_end_current_render_pass() in wined3d for essentially this reason; there may be some room for optimisation there as well.)
In any case, it would be nice if we were at least able to reproduce the issue this is trying to fix...
On Tue Apr 11 07:02:30 2023 +0000, Henri Verbeet wrote:
My understanding from the "Remarks" section in
https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12com... is this: suppose you create a command list that paints a resource all red and another one that paints the same texture all green. If you submit the two command lists with two separate `ExecuteCommandLists()` invokations, then you're sure the resource will be painted in whatever color is submitted last; Well, sure, but command buffers submitted to a Vulkan queue don't quite execute in random order either; submission order, primitive order, rasterisation order, etc. are all defined by the spec. Conor's comment seems to suggest this may be about ordering compute and graphics commands, which indeed normally don't wait for each other in Vulkan. Handling that wouldn't quite require a full memory barrier after each submission though. (Incidentally, we have barriers in adapter_vk_dispatch_compute() and wined3d_context_vk_end_current_render_pass() in wined3d for essentially this reason; there may be some room for optimisation there as well.) In any case, it would be nice if we were at least able to reproduce the issue this is trying to fix...
I think the issue is:
Unless otherwise specified, and without explicit synchronization, the various commands submitted to a queue via command buffers may execute in arbitrary order relative to each other, and/or concurrently. Also, the memory side effects of those commands may not be directly visible to other commands without explicit memory dependencies. This is true within a command buffer, and across command buffers submitted to a given queue.
The barrier is not needed if a semaphore is submitted after a command buffer. We could track that and only emit the barrier if necessary.
I've been unable to get the cacao demo to do anything. It just exits immediately.
I think the issue is:
Unless otherwise specified, and without explicit synchronization, the various commands submitted to a queue via command buffers may execute in arbitrary order relative to each other, and/or concurrently. Also, the memory side effects of those commands may not be directly visible to other commands without explicit memory dependencies. This is true within a command buffer, and across command buffers submitted to a given queue.
The barrier is not needed if a semaphore is submitted after a command buffer. We could track that and only emit the barrier if necessary.
The "Unless otherwise specified" there is important though; as mentioned earlier, the Vulkan spec does in fact specify an ordering for a lot of operations. The linked MSDN text is quite a bit less precise, but at first sight it doesn't appear to give any additional guarantees about availability or visibility of the results of operations; I imagine the number of affected operations to be fairly limited.
To stick with Giovanni's earlier example, suppose that in Vulkan we have a command buffer A containing a render pass with a vkCmdClearAttachments() call that clears a particular render target to red, and a command buffer B containing a similar render pass clearing that same render target to green. If we submit those with a single vkQueueSubmit() in the order {A, B}, the render target colour will end up being green. Specifically, submission order is specified to be {A, B} in that case, primitive order doesn't allow reordering them, and rasterisation order specifies that the resulting writes happen in primitive order. That's generally true for other drawing commands as well. On the other hand, transfer operations (AFAIK) don't generally have such ordering relative to each other, or relative to draw commands, but they do generally require layout transitions in order to switch between using them as transfer source, transfer destination, or with draw commands. The question really ends up being which guarantees d3d12 provides that we don't already get from Vulkan.
I got the CACAO demo running. It looks exactly like the screenshot in the docs, has no issues, and the additional barrier has no effect. Maybe there was a bug in the demo.
This merge request was closed by Conor McCarthy.