On Tue, 29 Mar 2022 at 18:51, Jan Sikorski jsikorski@codeweavers.com wrote:
My test subject is Final Fantasy XIV, which stresses this functionality quite a bit. It’s executing many draw calls in total, tightly, and between the calls it almost always changes the shader, input layout and/or vertex buffers. This patch increases the frame rate of a particular scene from 73 to 85 (approximately). In another test I changed the comparison function to be a plain memcmp of the entire pipeline key (with adjustments to make it work), which performed at 52 fps. Previously I also tried using the hash for early rejection only, but it didn’t improve the frame rate too much (sorry, no number now). As it turns out most of the comparisons, in FFXIV anyway, are decided early, with the notable exception of the last one in case of a cache hit. Last thing I tried was separating tracking STATE_STREAMSRC from the stuff that only requires us to rebind vertex buffers, i.e. buffers and offsets, which didn’t have a measurable effect on the frame rate in this case. The search took a significant percentage of adapter_vk_draw_primitive call, which I eyeballed to be around 30% on average, about half of which is the final full comparison. It looks like it’s slow because the keys are big, also a profiler is telling me that cache and branch predictor misses are impacting this code (though I don’t trust this reading much). In that case what you mentioned would likely improve it, but maybe not quite as much using just the hash, especially when compared to using the shrinked key for computing the hash too. Given that the lookup is taking up a sizeable chunk of the entire draw call, and this is a hot path, I’m in favor of making sure it’s about as fast as it can be. Regarding controversy, as long as the probability of a collision is of about the same magnitude as probability of hardware failure, I don’t really care. 64 bits might be on the edge, but 128 would suffice?
I suppose that's fair enough. For what it's worth, I'm not usually that lucky, so I tend to view these things as "if it can collide you might as well assume it's going to collide". And of course it's going to be a total pain to debug when it does. That said, I'd be willing to give the 128-bit hash a try.
A couple of other thoughts:
- 73 fps to 85 fps would be about a 2 ms improvement. The size of the wined3d_graphics_pipeline_key_vk structure is about 1.5 kB. (We have two of these structures when doing the compare, but we still need to read one of them with the proposed patch.) How many draw calls are there per frame on average in this application? 5000 may be a bit on the high side, but that would get us to about 3.6 GiB/s. That seems plausible for DDR4, but a bit on the low side. Any chance some well-placed __builtin_prefetch() calls would improve things? It's clearly not going to be as fast as simply skipping the compare, but perhaps we can get a bit closer to actual DDR4 speeds.
- This seems to imply that hashing the key is still going to take a significant amount of time; about on the order of the 2 ms the proposed patch would save. Reducing the key size would still seem helpful in that case.
- If we can get the key to fit in a cacheline, we're probably not going to see further improvement by comparing only the hashes. This may be non-trivial though.