Here's another suggestion on how to track resources: Don't. Or at least, not unconditionally.
In d3d11 we have default, immutable, dynamic and staging resources. Default and immutable resources can't be mapped. Dynamic ones should be mapped with no-overwrite or discard, which doesn't require tracking on our end. That leaves staging resources, but they can't be used for drawing.
So in the event that a resource is mapped synchronously just stall the pipeline. If it happens too often (or we're running a d3d <= 9 client) we can always start to track draws. I see at least World of Tanks sync-mapping staging resources, so presumably we need to track CopyResource and friends, but those are easier than draws.
The same consideration applies to d3d9ex, since there's no managed pool (well, except for that 0x6 re-add-D3DPOOL-MANAGED pool .NET stuff uses). Track copies and stall if a D3DPOOL_DEFAULT resource is mapped without async map flags.
I know of d3d9 and earlier applications (e.g. 3DMark2001) abusing managed resources and expecting maps to be fast if the resource hasn't been used lately. So we need a fallback to the current mode. We already have the context->ops->acquire_resource indirection, so we should be able to add it without another indirection. Eventually I think we should pull that indirection into acquire_shader_resources (write three versions of acquire_shader_resources, one that calls resource_acquire directly, one for deferred contexts, one no-op) and set the right one for the context. That way we don't do an indirect jump for each resource and can no-op the entire thing if we know we won't track those resources.
Am Freitag, 11. Februar 2022, 13:13:09 EAT schrieben Sie:
Hi, below is an email I drafted that explains the acquiring reduction idea.
—
Perhaps for draw and dispatch calls we could track most recent times any resource is bound and unbound, as well as, globally, the most recent time of a draw/dispatch call. We wouldn’t iterate over all bound resources during each draw/dispatch, instead, to see if a resource is busy, we would check a) if it is currently bound or will be bound, and if so b) if there are any draws/dispatches in the queue.
To be more specific, in struct wined3d_resource we’d keep “most_recent_bind_time” and “most_recent_unbind_time”, and struct wined3d_cs (or somewhere else) “most_recent_draw_time” and “most_recent_dispatch_time”. Then check:
- if tail < most_recent_bind_time or tail < most_recent_unbind_time or
most_recent_unbind_time < most_recent_bind_time, the resource is considered bound 2) if tail < most_recent_bind_time and tail > most_recent_unbind_time, and most_recent_draw/dispatch_time < most_recent_bind_time, we’re idle 3) if the resource is bound, and tail < most_recent_draw/dispatch_time, the resource is busy 4) otherwise we're idle.
that would be in addition to what you proposed, which I think works fine for other calls.
To reduce graphics/compute false positives, “most_recent_bind_time” and “most_recent_unbind_time” could be tracked separately for both use cases.
There’s still some potential for false positives, if the queue contains: unbind “A”, …, draw, …, bind “A”, we’d consider resource “A” to be busy until the unbind is executed. But maybe that case is benign enough to ignore.
Also, another thing to think about is how to better handle acquiring resources in deferred contexts.
On 4 Jan 2022, at 16:03, Stefan Dösinger stefandoesinger@gmail.com wrote:
Hi,
Before the holidays I spent some time optimizing the cs resource fencing code. The current state is attached for review. I'll send it for upstreaming after the code freeze.
The basic idea is to use the default queue head and tail for fencing. This completely removes any work on the command stream thread side, and the main thread work goes from an interlocked op to a simple assignment. Together with the technically unrelated patch 4 it improves a microbenchmark I wrote for this (https://github.com/stefand/perftest/tree/main/resource_tracking_d3d11) from ~200 fps to ~700 fps on my Ryzen CPU. Other CPUs have lower gains, but still more than double the framerate. It also produces a measurable improvement in Rocket League once other known CS issues are hacked away.
Items for discussion:
- I am not entirely sure I do the ULONG / LONG handling correctly. I
guess we could get away with just keeping everything as signed LONGs, but technically signed int overflow is undefined behavior. Interlocked ops accept LONG * though...
resource_acquire could be renamed to something else
Separate read and write timestamps. This should be easy to add on top
of the current code.
- Traversing resource->device->cs->queue in wined3d_resource_acquire is
ugly. I'm contemplating passing const struct wined3d_cs or the timestamp to it explicitly.
- We still iterate over a huge number of resources. Does anyone have
ideas how to cut this down?
Happy new Year, Stefan <0001-wined3d-Use-extra-bits-in-the-queue-head-and-tail-co.patch><0002-win ed3d-Don-t-acquire-the-resource-in-update_sub_res.patch><0003-wined3d-Use- the-default-queue-index-for-resource-fen.patch><0004-Move-resource-type-aw ay-from-the-access-time-field.patch><0005-wined3d-Remove-the-no-op-wined3d _resource_release.patch>