This function is the main reason for much lower performance vs Windows in HZD at least. Not loading device from dst_heap when we can have it in a register is an advantage. I've added a comment to clarify.
In that case, should we be calling d3d12_desc_copy() in a loop from device.c at all? Or would it be better to e.g. introduce a "d3d12_descriptor_heap_copy()" function that takes care of that loop and inlines d3d12_desc_copy() (and quite possibly d3d12_desc_write_atomic())? We may also want to consider placing "device" in the same cacheline as "use_vk_heaps" inside struct d3d12_descriptor_heap in that case.