On 28 Mar 2022, at 14:57, Henri Verbeet hverbeet@gmail.com wrote:
On Fri, 25 Mar 2022 at 13:31, Jan Sikorski jsikorski@codeweavers.com wrote:
Avoid doing full comparisons, which are expensive. There's some possible improvements that can be made if necessary:
- Select a better hash function. Murmur64A was picked mainly because it's small and simple.
- Don't always hash the entire pipeline key, hash incrementally.
- Use an actual hash table instead of a tree, or at least allocate the nodes more contiguously.
Note that it is assumed that no hash collision will happen. The probability of one is low (under an assumption that the hash function is good), but we could reasonably use a 128 bit hash to make it even less probable.
Hash function implementation based on the one published in SMHasher. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp
I think the first part of this, including a hash of the pipeline key for easy rejection, is uncontroversial, although it would be helpful to include some performance information. I.e., how much does this actually help?
Skipping the full compare entirely seems quite a bit more questionable. I know from our earlier conversation on IRC that the impact of that was significant, although it wasn't immediately clear why that's the case. I.e., is the issue excessive pipeline invalidation, or is this simply about the size of the key, or is it e.g. about chasing the pointers contained in the key? Those are potentially all issues we could mitigate. For example, instead of including e.g. a VkPipelineColorBlendStateCreateInfo structure in the key, we could include a pointer to one and then deduplicate those. That would mean introducing a lookup for "blend_desc" in wined3d_context_vk_update_blend_state(), but would allows us to do a simple pointer comparison in wined3d_graphics_pipeline_vk_compare(), reducing the size of the pipeline key, and potentially helping the cases where STATE_BLEND and STATE_FRAMEBUFFER weren't invalidated.
Somewhat similarly, wined3d state objects like wined3d_blend_state are effectively unique. In the case of d3d9 and earlier that's because of the way we do state tracking there (i.e., see wined3d_device_apply_stateblock()), and in the case of d3d10+ that's because state objects are already deduplicated on that level. That means it may be advantageous to use the D3D state as key instead of the Vulkan state in some cases, because we could simply use pointer compares for the D3D state.
My test subject is Final Fantasy XIV, which stresses this functionality quite a bit. It’s executing many draw calls in total, tightly, and between the calls it almost always changes the shader, input layout and/or vertex buffers. This patch increases the frame rate of a particular scene from 73 to 85 (approximately). In another test I changed the comparison function to be a plain memcmp of the entire pipeline key (with adjustments to make it work), which performed at 52 fps. Previously I also tried using the hash for early rejection only, but it didn’t improve the frame rate too much (sorry, no number now). As it turns out most of the comparisons, in FFXIV anyway, are decided early, with the notable exception of the last one in case of a cache hit. Last thing I tried was separating tracking STATE_STREAMSRC from the stuff that only requires us to rebind vertex buffers, i.e. buffers and offsets, which didn’t have a measurable effect on the frame rate in this case. The search took a significant percentage of adapter_vk_draw_primitive call, which I eyeballed to be around 30% on average, about half of which is the final full comparison. It looks like it’s slow because the keys are big, also a profiler is telling me that cache and branch predictor misses are impacting this code (though I don’t trust this reading much). In that case what you mentioned would likely improve it, but maybe not quite as much using just the hash, especially when compared to using the shrinked key for computing the hash too. Given that the lookup is taking up a sizeable chunk of the entire draw call, and this is a hot path, I’m in favor of making sure it’s about as fast as it can be. Regarding controversy, as long as the probability of a collision is of about the same magnitude as probability of hardware failure, I don’t really care. 64 bits might be on the edge, but 128 would suffice?
Cheers, Jan.