Hi,
Before the holidays I spent some time optimizing the cs resource fencing code. The current state is attached for review. I'll send it for upstreaming after the code freeze.
The basic idea is to use the default queue head and tail for fencing. This completely removes any work on the command stream thread side, and the main thread work goes from an interlocked op to a simple assignment. Together with the technically unrelated patch 4 it improves a microbenchmark I wrote for this (https://github.com/stefand/perftest/tree/main/resource_tracking_d3d11) from ~200 fps to ~700 fps on my Ryzen CPU. Other CPUs have lower gains, but still more than double the framerate. It also produces a measurable improvement in Rocket League once other known CS issues are hacked away.
Items for discussion:
1) I am not entirely sure I do the ULONG / LONG handling correctly. I guess we could get away with just keeping everything as signed LONGs, but technically signed int overflow is undefined behavior. Interlocked ops accept LONG * though...
2) resource_acquire could be renamed to something else
3) Separate read and write timestamps. This should be easy to add on top of the current code.
4) Traversing resource->device->cs->queue in wined3d_resource_acquire is ugly. I'm contemplating passing const struct wined3d_cs or the timestamp to it explicitly.
5) We still iterate over a huge number of resources. Does anyone have ideas how to cut this down?
Happy new Year, Stefan