Re: RFC: Rework of wined3d cs fencing - wine-devel

23 Feb 2022

      Here's another suggestion on how to track resources: Don't. Or at least, not 
unconditionally.
In d3d11 we have default, immutable, dynamic and staging resources. Default 
and immutable resources can't be mapped. Dynamic ones should be mapped with 
no-overwrite or discard, which doesn't require tracking on our end. That 
leaves staging resources, but they can't be used for drawing.
So in the event that a resource is mapped synchronously just stall the 
pipeline. If it happens too often (or we're running a d3d <= 9 client) we can 
always start to track draws. I see at least World of Tanks sync-mapping 
staging resources, so presumably we need to track CopyResource and friends, 
but those are easier than draws.
The same consideration applies to d3d9ex, since there's no managed pool (well, 
except for that 0x6 re-add-D3DPOOL-MANAGED pool .NET stuff uses). Track copies 
and stall if a D3DPOOL_DEFAULT resource is mapped without async map flags.
I know of d3d9 and earlier applications (e.g. 3DMark2001) abusing managed 
resources and expecting maps to be fast if the resource hasn't been used 
lately. So we need a fallback to the current mode. We already have the 
context->ops->acquire_resource indirection, so we should be able to add it 
without another indirection. Eventually I think we should pull that 
indirection into acquire_shader_resources (write three versions of 
acquire_shader_resources, one that calls resource_acquire directly, one for 
deferred contexts, one no-op) and set the right one for the context. That way 
we don't do an indirect jump for each resource and can no-op the entire thing 
if we know we won't track those resources.
Am Freitag, 11. Februar 2022, 13:13:09 EAT schrieben Sie:
...
Hi, below is an email I drafted that explains the acquiring reduction idea.
—
Perhaps for draw and dispatch calls we could track most recent times any
resource is bound and unbound, as well as, globally, the most recent time
of a draw/dispatch call. We wouldn’t iterate over all bound resources
during each draw/dispatch, instead, to see if a resource is busy, we would
check a) if it is currently bound or will be bound, and if so b) if there
are any draws/dispatches in the queue.
To be more specific, in struct wined3d_resource we’d keep
“most_recent_bind_time” and “most_recent_unbind_time”, and struct
wined3d_cs (or somewhere else) “most_recent_draw_time” and
“most_recent_dispatch_time”. Then check:

if tail < most_recent_bind_time or tail < most_recent_unbind_time or

most_recent_unbind_time < most_recent_bind_time, the resource is considered
bound 2) if tail < most_recent_bind_time and tail >
most_recent_unbind_time, and most_recent_draw/dispatch_time <
most_recent_bind_time, we’re idle 3) if the resource is bound, and tail <
most_recent_draw/dispatch_time, the resource is busy 4) otherwise we're
idle.
that would be in addition to what you proposed, which I think works fine for
other calls.
To reduce graphics/compute false positives, “most_recent_bind_time” and
“most_recent_unbind_time” could be tracked separately for both use cases.
There’s still some potential for false positives, if the queue contains:
unbind “A”, …, draw, …, bind “A”, we’d consider resource “A” to be busy
until the unbind is executed. But maybe that case is benign enough to
ignore.
Also, another thing to think about is how to better handle acquiring
resources in deferred contexts.
...
On 4 Jan 2022, at 16:03, Stefan Dösinger stefandoesinger@gmail.com
wrote:
Hi,
Before the holidays I spent some time optimizing the cs resource fencing
code. The current state is attached for review. I'll send it for
upstreaming after the code freeze.
The basic idea is to use the default queue head and tail for fencing. This
completely removes any work on the command stream thread side, and the
main
thread work goes from an interlocked op to a simple assignment. Together
with the technically unrelated patch 4 it improves a microbenchmark I
wrote for this
(https://github.com/stefand/perftest/tree/main/resource_tracking_d3d11)
from ~200 fps to ~700 fps on my Ryzen CPU. Other CPUs have lower gains,
but still more than double the framerate. It also produces a measurable
improvement in Rocket League once other known CS issues are hacked away.
Items for discussion:

I am not entirely sure I do the ULONG / LONG handling correctly. I

guess we could get away with just keeping everything as signed LONGs, but
technically signed int overflow is undefined behavior. Interlocked ops
accept LONG * though...

resource_acquire could be renamed to something else

Separate read and write timestamps. This should be easy to add on top

of
the current code.

Traversing resource->device->cs->queue in wined3d_resource_acquire is

ugly. I'm contemplating passing const struct wined3d_cs or the timestamp
to it explicitly.

We still iterate over a huge number of resources. Does anyone have

ideas
how to cut this down?
Happy new Year,
Stefan
<0001-wined3d-Use-extra-bits-in-the-queue-head-and-tail-co.patch><0002-win
ed3d-Don-t-acquire-the-resource-in-update_sub_res.patch><0003-wined3d-Use-
the-default-queue-index-for-resource-fen.patch><0004-Move-resource-type-aw
ay-from-the-access-time-field.patch><0005-wined3d-Remove-the-no-op-wined3d
_resource_release.patch>