Avoid doing full comparisons, which are expensive. There's some possible improvements that can be made if necessary: - Select a better hash function. Murmur64A was picked mainly because it's small and simple. - Don't always hash the entire pipeline key, hash incrementally. - Use an actual hash table instead of a tree, or at least allocate the nodes more contiguously. Note that it is assumed that no hash collision will happen. The probability of one is low (under an assumption that the hash function is good), but we could reasonably use a 128 bit hash to make it even less probable.
Hash function implementation based on the one published in SMHasher. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp
Signed-off-by: Jan Sikorski jsikorski@codeweavers.com --- dlls/wined3d/context_vk.c | 128 ++++++++++++++++----------------- dlls/wined3d/wined3d_private.h | 2 +- 2 files changed, 64 insertions(+), 66 deletions(-)
diff --git a/dlls/wined3d/context_vk.c b/dlls/wined3d/context_vk.c index 436973fe76f..e853bd568e2 100644 --- a/dlls/wined3d/context_vk.c +++ b/dlls/wined3d/context_vk.c @@ -1844,82 +1844,78 @@ static int wined3d_pipeline_layout_vk_compare(const void *key, const struct wine return memcmp(a->bindings, b->bindings, a->binding_count * sizeof(*a->bindings)); }
-static int wined3d_graphics_pipeline_vk_compare(const void *key, const struct wine_rb_entry *entry) +uint64_t hash_murmur64a(const void *key, int length, uint64_t seed) { - const struct wined3d_graphics_pipeline_key_vk *a = key; - const struct wined3d_graphics_pipeline_key_vk *b = &WINE_RB_ENTRY_VALUE(entry, - const struct wined3d_graphics_pipeline_vk, entry)->key; - unsigned int i; - int ret; + const uint64_t *qwords = (const uint64_t *)key; + const uint64_t factor = 0xc6a4a7935bd1e995LLU; + const uint64_t *qend = &qwords[length >> 3]; + const uint8_t *bytes = (const uint8_t *)qend; + uint64_t hash = seed ^ (length * factor); + const int shift = 47;
- if ((ret = wined3d_uint32_compare(a->pipeline_desc.stageCount, b->pipeline_desc.stageCount))) - return ret; - for (i = 0; i < a->pipeline_desc.stageCount; ++i) + while (qwords != qend) { - if ((ret = wined3d_uint64_compare(a->stages[i].module, b->stages[i].module))) - return ret; - } + uint64_t value = *qwords++;
- if ((ret = wined3d_uint32_compare(a->divisor_desc.vertexBindingDivisorCount, - b->divisor_desc.vertexBindingDivisorCount))) - return ret; - if ((ret = memcmp(a->divisors, b->divisors, - a->divisor_desc.vertexBindingDivisorCount * sizeof(*a->divisors)))) - return ret; + value *= factor; + value ^= value >> shift; + value *= factor;
- if ((ret = wined3d_uint32_compare(a->input_desc.vertexAttributeDescriptionCount, - b->input_desc.vertexAttributeDescriptionCount))) - return ret; - if ((ret = memcmp(a->attributes, b->attributes, - a->input_desc.vertexAttributeDescriptionCount * sizeof(*a->attributes)))) - return ret; - if ((ret = wined3d_uint32_compare(a->input_desc.vertexBindingDescriptionCount, - b->input_desc.vertexBindingDescriptionCount))) - return ret; - if ((ret = memcmp(a->bindings, b->bindings, - a->input_desc.vertexBindingDescriptionCount * sizeof(*a->bindings)))) - return ret; - - if ((ret = wined3d_uint32_compare(a->ia_desc.topology, b->ia_desc.topology))) - return ret; - if ((ret = wined3d_uint32_compare(a->ia_desc.primitiveRestartEnable, b->ia_desc.primitiveRestartEnable))) - return ret; - - if ((ret = wined3d_uint32_compare(a->ts_desc.patchControlPoints, b->ts_desc.patchControlPoints))) - return ret; + hash ^= value; + hash *= factor; + }
- if ((ret = memcmp(&a->viewport, &b->viewport, sizeof(a->viewport)))) - return ret; + switch (length & 7) + { + case 7: hash ^= (uint64_t)(bytes[6]) << 48; + case 6: hash ^= (uint64_t)(bytes[5]) << 40; + case 5: hash ^= (uint64_t)(bytes[4]) << 32; + case 4: hash ^= (uint64_t)(bytes[3]) << 24; + case 3: hash ^= (uint64_t)(bytes[2]) << 16; + case 2: hash ^= (uint64_t)(bytes[1]) << 8; + case 1: hash ^= (uint64_t)(bytes[0]);
- if ((ret = memcmp(&a->scissor, &b->scissor, sizeof(a->scissor)))) - return ret; + hash *= factor; + };
- if ((ret = memcmp(&a->rs_desc, &b->rs_desc, sizeof(a->rs_desc)))) - return ret; + hash ^= hash >> shift; + hash *= factor; + hash ^= hash >> shift;
- if ((ret = wined3d_uint32_compare(a->ms_desc.rasterizationSamples, b->ms_desc.rasterizationSamples))) - return ret; - if ((ret = wined3d_uint32_compare(a->ms_desc.alphaToCoverageEnable, b->ms_desc.alphaToCoverageEnable))) - return ret; - if ((ret = wined3d_uint32_compare(a->sample_mask, b->sample_mask))) - return ret; + return hash; +}
- if ((ret = memcmp(&a->ds_desc, &b->ds_desc, sizeof(a->ds_desc)))) - return ret; +static uint64_t wined3d_graphics_pipeline_vk_key_hash(struct wined3d_graphics_pipeline_key_vk *key) +{ + uint64_t result = 0, i;
- if ((ret = wined3d_uint32_compare(a->blend_desc.attachmentCount, b->blend_desc.attachmentCount))) - return ret; - if ((ret = memcmp(a->blend_attachments, b->blend_attachments, - a->blend_desc.attachmentCount * sizeof(*a->blend_attachments)))) - return ret; + for (i = 0; i < key->pipeline_desc.stageCount; ++i) + result = hash_murmur64a(&key->stages[i].module, sizeof(key->stages[i].module), result);
- if ((ret = wined3d_uint64_compare(a->pipeline_desc.layout, b->pipeline_desc.layout))) - return ret; + result = hash_murmur64a(key->divisors, key->divisor_desc.vertexBindingDivisorCount * sizeof(*key->divisors), result); + result = hash_murmur64a(key->attributes, key->input_desc.vertexAttributeDescriptionCount * sizeof(*key->attributes), result); + result = hash_murmur64a(key->bindings, key->input_desc.vertexBindingDescriptionCount * sizeof(*key->bindings), result); + result = hash_murmur64a(&key->viewport, sizeof(key->viewport), result); + result = hash_murmur64a(&key->scissor, sizeof(key->scissor), result); + result = hash_murmur64a(&key->sample_mask, sizeof(key->sample_mask), result); + result = hash_murmur64a(key->blend_attachments, key->blend_desc.attachmentCount * sizeof(*key->blend_attachments), result); + result = hash_murmur64a(&key->ia_desc.topology, sizeof(key->ia_desc.topology), result); + result = hash_murmur64a(&key->ia_desc.primitiveRestartEnable, sizeof(key->ia_desc.primitiveRestartEnable), result); + result = hash_murmur64a(&key->ts_desc.patchControlPoints, sizeof(key->ts_desc.patchControlPoints), result); + result = hash_murmur64a(&key->rs_desc, sizeof(key->rs_desc), result); + result = hash_murmur64a(&key->ms_desc.rasterizationSamples, sizeof(key->ms_desc.rasterizationSamples), result); + result = hash_murmur64a(&key->ms_desc.alphaToCoverageEnable, sizeof(key->ms_desc.alphaToCoverageEnable), result); + result = hash_murmur64a(&key->ds_desc, sizeof(key->ds_desc), result); + result = hash_murmur64a(&key->pipeline_desc.layout, sizeof(key->pipeline_desc.layout), result); + result = hash_murmur64a(&key->pipeline_desc.renderPass, sizeof(key->pipeline_desc.renderPass), result);
- if ((ret = wined3d_uint64_compare(a->pipeline_desc.renderPass, b->pipeline_desc.renderPass))) - return ret; + return result; +}
- return 0; +static int wined3d_graphics_pipeline_vk_compare(const void *hash, const struct wine_rb_entry *entry) +{ + return wined3d_uint64_compare(*(uint64_t *)hash, + WINE_RB_ENTRY_VALUE(entry, const struct wined3d_graphics_pipeline_vk, entry)->hash); }
static int wined3d_bo_slab_vk_compare(const void *key, const struct wine_rb_entry *entry) @@ -3120,15 +3116,17 @@ static VkPipeline wined3d_context_vk_get_graphics_pipeline(struct wined3d_contex struct wined3d_graphics_pipeline_vk *pipeline_vk; struct wined3d_graphics_pipeline_key_vk *key; struct wine_rb_entry *entry; + uint64_t hash; VkResult vr;
key = &context_vk->graphics.pipeline_key_vk; - if ((entry = wine_rb_get(&context_vk->graphics_pipelines, key))) + hash = wined3d_graphics_pipeline_vk_key_hash(key); + if ((entry = wine_rb_get(&context_vk->graphics_pipelines, &hash))) return WINE_RB_ENTRY_VALUE(entry, struct wined3d_graphics_pipeline_vk, entry)->vk_pipeline;
if (!(pipeline_vk = heap_alloc(sizeof(*pipeline_vk)))) return VK_NULL_HANDLE; - pipeline_vk->key = *key; + pipeline_vk->hash = hash;
if ((vr = VK_CALL(vkCreateGraphicsPipelines(device_vk->vk_device, VK_NULL_HANDLE, 1, &key->pipeline_desc, NULL, &pipeline_vk->vk_pipeline))) < 0) @@ -3138,7 +3136,7 @@ static VkPipeline wined3d_context_vk_get_graphics_pipeline(struct wined3d_contex return VK_NULL_HANDLE; }
- if (wine_rb_put(&context_vk->graphics_pipelines, &pipeline_vk->key, &pipeline_vk->entry) == -1) + if (wine_rb_put(&context_vk->graphics_pipelines, &pipeline_vk->hash, &pipeline_vk->entry) == -1) ERR("Failed to insert pipeline.\n");
return pipeline_vk->vk_pipeline; diff --git a/dlls/wined3d/wined3d_private.h b/dlls/wined3d/wined3d_private.h index 5b3eb0136b0..34b7c9eab29 100644 --- a/dlls/wined3d/wined3d_private.h +++ b/dlls/wined3d/wined3d_private.h @@ -2521,7 +2521,7 @@ struct wined3d_graphics_pipeline_key_vk struct wined3d_graphics_pipeline_vk { struct wine_rb_entry entry; - struct wined3d_graphics_pipeline_key_vk key; + uint64_t hash; VkPipeline vk_pipeline; };
On Fri, 25 Mar 2022 at 13:31, Jan Sikorski jsikorski@codeweavers.com wrote:
Avoid doing full comparisons, which are expensive. There's some possible improvements that can be made if necessary:
- Select a better hash function. Murmur64A was picked mainly because it's small and simple.
- Don't always hash the entire pipeline key, hash incrementally.
- Use an actual hash table instead of a tree, or at least allocate the nodes more contiguously.
Note that it is assumed that no hash collision will happen. The probability of one is low (under an assumption that the hash function is good), but we could reasonably use a 128 bit hash to make it even less probable.
Hash function implementation based on the one published in SMHasher. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp
I think the first part of this, including a hash of the pipeline key for easy rejection, is uncontroversial, although it would be helpful to include some performance information. I.e., how much does this actually help?
Skipping the full compare entirely seems quite a bit more questionable. I know from our earlier conversation on IRC that the impact of that was significant, although it wasn't immediately clear why that's the case. I.e., is the issue excessive pipeline invalidation, or is this simply about the size of the key, or is it e.g. about chasing the pointers contained in the key? Those are potentially all issues we could mitigate. For example, instead of including e.g. a VkPipelineColorBlendStateCreateInfo structure in the key, we could include a pointer to one and then deduplicate those. That would mean introducing a lookup for "blend_desc" in wined3d_context_vk_update_blend_state(), but would allows us to do a simple pointer comparison in wined3d_graphics_pipeline_vk_compare(), reducing the size of the pipeline key, and potentially helping the cases where STATE_BLEND and STATE_FRAMEBUFFER weren't invalidated.
Somewhat similarly, wined3d state objects like wined3d_blend_state are effectively unique. In the case of d3d9 and earlier that's because of the way we do state tracking there (i.e., see wined3d_device_apply_stateblock()), and in the case of d3d10+ that's because state objects are already deduplicated on that level. That means it may be advantageous to use the D3D state as key instead of the Vulkan state in some cases, because we could simply use pointer compares for the D3D state.
+uint64_t hash_murmur64a(const void *key, int length, uint64_t seed) {
- const struct wined3d_graphics_pipeline_key_vk *a = key;
- const struct wined3d_graphics_pipeline_key_vk *b = &WINE_RB_ENTRY_VALUE(entry,
const struct wined3d_graphics_pipeline_vk, entry)->key;
- unsigned int i;
- int ret;
- const uint64_t *qwords = (const uint64_t *)key;
- const uint64_t factor = 0xc6a4a7935bd1e995LLU;
We tend to use the lowercase type suffixes.
On 28 Mar 2022, at 14:57, Henri Verbeet hverbeet@gmail.com wrote:
On Fri, 25 Mar 2022 at 13:31, Jan Sikorski jsikorski@codeweavers.com wrote:
Avoid doing full comparisons, which are expensive. There's some possible improvements that can be made if necessary:
- Select a better hash function. Murmur64A was picked mainly because it's small and simple.
- Don't always hash the entire pipeline key, hash incrementally.
- Use an actual hash table instead of a tree, or at least allocate the nodes more contiguously.
Note that it is assumed that no hash collision will happen. The probability of one is low (under an assumption that the hash function is good), but we could reasonably use a 128 bit hash to make it even less probable.
Hash function implementation based on the one published in SMHasher. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp
I think the first part of this, including a hash of the pipeline key for easy rejection, is uncontroversial, although it would be helpful to include some performance information. I.e., how much does this actually help?
Skipping the full compare entirely seems quite a bit more questionable. I know from our earlier conversation on IRC that the impact of that was significant, although it wasn't immediately clear why that's the case. I.e., is the issue excessive pipeline invalidation, or is this simply about the size of the key, or is it e.g. about chasing the pointers contained in the key? Those are potentially all issues we could mitigate. For example, instead of including e.g. a VkPipelineColorBlendStateCreateInfo structure in the key, we could include a pointer to one and then deduplicate those. That would mean introducing a lookup for "blend_desc" in wined3d_context_vk_update_blend_state(), but would allows us to do a simple pointer comparison in wined3d_graphics_pipeline_vk_compare(), reducing the size of the pipeline key, and potentially helping the cases where STATE_BLEND and STATE_FRAMEBUFFER weren't invalidated.
Somewhat similarly, wined3d state objects like wined3d_blend_state are effectively unique. In the case of d3d9 and earlier that's because of the way we do state tracking there (i.e., see wined3d_device_apply_stateblock()), and in the case of d3d10+ that's because state objects are already deduplicated on that level. That means it may be advantageous to use the D3D state as key instead of the Vulkan state in some cases, because we could simply use pointer compares for the D3D state.
My test subject is Final Fantasy XIV, which stresses this functionality quite a bit. It’s executing many draw calls in total, tightly, and between the calls it almost always changes the shader, input layout and/or vertex buffers. This patch increases the frame rate of a particular scene from 73 to 85 (approximately). In another test I changed the comparison function to be a plain memcmp of the entire pipeline key (with adjustments to make it work), which performed at 52 fps. Previously I also tried using the hash for early rejection only, but it didn’t improve the frame rate too much (sorry, no number now). As it turns out most of the comparisons, in FFXIV anyway, are decided early, with the notable exception of the last one in case of a cache hit. Last thing I tried was separating tracking STATE_STREAMSRC from the stuff that only requires us to rebind vertex buffers, i.e. buffers and offsets, which didn’t have a measurable effect on the frame rate in this case. The search took a significant percentage of adapter_vk_draw_primitive call, which I eyeballed to be around 30% on average, about half of which is the final full comparison. It looks like it’s slow because the keys are big, also a profiler is telling me that cache and branch predictor misses are impacting this code (though I don’t trust this reading much). In that case what you mentioned would likely improve it, but maybe not quite as much using just the hash, especially when compared to using the shrinked key for computing the hash too. Given that the lookup is taking up a sizeable chunk of the entire draw call, and this is a hot path, I’m in favor of making sure it’s about as fast as it can be. Regarding controversy, as long as the probability of a collision is of about the same magnitude as probability of hardware failure, I don’t really care. 64 bits might be on the edge, but 128 would suffice?
Cheers, Jan.
On Tue, 29 Mar 2022 at 18:51, Jan Sikorski jsikorski@codeweavers.com wrote:
My test subject is Final Fantasy XIV, which stresses this functionality quite a bit. It’s executing many draw calls in total, tightly, and between the calls it almost always changes the shader, input layout and/or vertex buffers. This patch increases the frame rate of a particular scene from 73 to 85 (approximately). In another test I changed the comparison function to be a plain memcmp of the entire pipeline key (with adjustments to make it work), which performed at 52 fps. Previously I also tried using the hash for early rejection only, but it didn’t improve the frame rate too much (sorry, no number now). As it turns out most of the comparisons, in FFXIV anyway, are decided early, with the notable exception of the last one in case of a cache hit. Last thing I tried was separating tracking STATE_STREAMSRC from the stuff that only requires us to rebind vertex buffers, i.e. buffers and offsets, which didn’t have a measurable effect on the frame rate in this case. The search took a significant percentage of adapter_vk_draw_primitive call, which I eyeballed to be around 30% on average, about half of which is the final full comparison. It looks like it’s slow because the keys are big, also a profiler is telling me that cache and branch predictor misses are impacting this code (though I don’t trust this reading much). In that case what you mentioned would likely improve it, but maybe not quite as much using just the hash, especially when compared to using the shrinked key for computing the hash too. Given that the lookup is taking up a sizeable chunk of the entire draw call, and this is a hot path, I’m in favor of making sure it’s about as fast as it can be. Regarding controversy, as long as the probability of a collision is of about the same magnitude as probability of hardware failure, I don’t really care. 64 bits might be on the edge, but 128 would suffice?
I suppose that's fair enough. For what it's worth, I'm not usually that lucky, so I tend to view these things as "if it can collide you might as well assume it's going to collide". And of course it's going to be a total pain to debug when it does. That said, I'd be willing to give the 128-bit hash a try.
A couple of other thoughts:
- 73 fps to 85 fps would be about a 2 ms improvement. The size of the wined3d_graphics_pipeline_key_vk structure is about 1.5 kB. (We have two of these structures when doing the compare, but we still need to read one of them with the proposed patch.) How many draw calls are there per frame on average in this application? 5000 may be a bit on the high side, but that would get us to about 3.6 GiB/s. That seems plausible for DDR4, but a bit on the low side. Any chance some well-placed __builtin_prefetch() calls would improve things? It's clearly not going to be as fast as simply skipping the compare, but perhaps we can get a bit closer to actual DDR4 speeds.
- This seems to imply that hashing the key is still going to take a significant amount of time; about on the order of the 2 ms the proposed patch would save. Reducing the key size would still seem helpful in that case.
- If we can get the key to fit in a cacheline, we're probably not going to see further improvement by comparing only the hashes. This may be non-trivial though.