This is probably the most ugly and controversial bit of API. I don't really know if this is the right approach to solving this.
sm4 registers match by register index, such that shaders can mostly be compiled in isolation. sm1 does not—registers may be specified in any order in the vertex and pixel shaders, and will be matched by usage and usage index.
By itself this is not much of a problem. Where it gets hairy is that we want to do some degree of caching, as well as pre-compilation, and avoid recompiling either shader every time it's matched with a new one.
Wine currently deals with this problem, for GLSL, by generating a "main" GLSL shader for the vertex shader, and then an extra function setup_vs_output(), and linking the two together every time a new pixel shader is used. This could in theory be used for SPIR-V, but it requires the use of extra, probably external, code to link SPIR-V shaders together, which I do not particularly anticipate being well-received.
(I'm not sure how Wine deals with this problem in the ARB backend. It seems to take the pixel shader signature into account when generating the vertex shader— cf. init_output_registers()—but it doesn't take it into account when looking up a vertex shader variant? I didn't look too closely at the code, so maybe I'm missing something.)
--
The vkd3d parts of this patch are quite straightforward, and looking at them, I think the design is quite intuitive in isolation. There may be some room for internal refactoring (in particular with an eye to not so much overloading the "register_index" field of struct shader_signature_element) but I'm relatively happy with the way it turned out. In isolation, that is.
The Wine part is worse. I've uploaded branches for vkd3d and Wine that use this API, and correctly handle shaders with some nontrivial reordering:
https://gitlab.winehq.org/zfigura/vkd3d/-/commits/himavant5
https://gitlab.winehq.org/zfigura/wine/-/commits/himavant_cb
The test can be run, as before, like so:
make tests/shader_runner.exe && WINE_D3D_CONFIG=renderer=vulkan wine tests/shader_runner.exe ../vkd3d/tests/hlsl/sm1-interstage-interface.shader_test
The interesting Wine parts are concentrated in a single patch, 5cfb9d930f11e. The patch takes a few shortcuts, partly because I wanted not to block the vkd3d API design questions, but also because while writing it I came up with a couple of problems that I wasn't sure how to fix. There are two main problems I see:
(1) This patch has the user pass the signature from the pixel shader when compiling the vertex shader, and looks up register indices already arbitrarily allocated by the pixel shader. This is problematic when trying to use this signature as a cache key, by virtue of it not being clear (or even defined) which fields are key elements and which aren't. It's also not particularly kind to lookup, on account of not being directly comparable with memcmp(). There are a few options I see:
(a) Provide an internal function to compare key elements. This feels... odd, like a very special-purpose function, but perhaps workable.
(b) Just make the user deal with it, and assert that all fields are key elements.
(c) Use some alternative, perhaps shortened structure as a field of vkd3d_shader_next_stage_info. This has the disadvantage that it is not as simple for a hypothetical user to retrieve from the pixel shader, but we would presumably provide a function to generate one from a shader signature. This would probably also be kinder to cache lookup if it's shorter.
(d) Make caching vkd3d's responsibility, to some degree. This seems daunting, but the more we optimize, the more difficult it may be to design API that allows for nice caching.
(2) Assuming we use signatures, there is a memory management problem that 5cfb9d930f11e spells out. This is probably a matter of "just fix it", but I suppose another option is to take the GLSL or ARB architecture.
--
April is the cruellest month, breeding Lilacs out of the dead land, mixing Memory and desire, stirring Dull roots with spring rain.
-- v2: vkd3d-shader: Implement a function to build a varying map between sm1 stages. vkd3d-shader: Implement remapping shader output registers to match the next shader's semantics.
From: Zebediah Figura zfigura@codeweavers.com
spirv will need this. --- libs/vkd3d-shader/d3dbc.c | 7 ++++++- tests/vkd3d_shader_api.c | 6 +++--- 2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/libs/vkd3d-shader/d3dbc.c b/libs/vkd3d-shader/d3dbc.c index 9a147e3d..22e463bf 100644 --- a/libs/vkd3d-shader/d3dbc.c +++ b/libs/vkd3d-shader/d3dbc.c @@ -524,6 +524,8 @@ static struct signature_element *find_signature_element_by_register_index( return NULL; }
+#define SM1_COLOR_REGISTER_OFFSET 8 + static bool add_signature_element(struct vkd3d_shader_sm1_parser *sm1, bool output, const char *name, unsigned int index, enum vkd3d_shader_sysval_semantic sysval, unsigned int register_index, bool is_dcl, unsigned int mask) @@ -606,7 +608,7 @@ static bool add_signature_element_from_register(struct vkd3d_shader_sm1_parser * return true; } return add_signature_element(sm1, false, "COLOR", register_index, - VKD3D_SHADER_SV_NONE, register_index, is_dcl, mask); + VKD3D_SHADER_SV_NONE, SM1_COLOR_REGISTER_OFFSET + register_index, is_dcl, mask);
case VKD3DSPR_TEXTURE: /* For vertex shaders, this is ADDR. */ @@ -633,6 +635,9 @@ static bool add_signature_element_from_register(struct vkd3d_shader_sm1_parser * /* fall through */
case VKD3DSPR_ATTROUT: + return add_signature_element(sm1, true, "COLOR", register_index, + VKD3D_SHADER_SV_NONE, SM1_COLOR_REGISTER_OFFSET + register_index, is_dcl, mask); + case VKD3DSPR_COLOROUT: return add_signature_element(sm1, true, "COLOR", register_index, VKD3D_SHADER_SV_NONE, register_index, is_dcl, mask); diff --git a/tests/vkd3d_shader_api.c b/tests/vkd3d_shader_api.c index 8bd728d9..4893f44f 100644 --- a/tests/vkd3d_shader_api.c +++ b/tests/vkd3d_shader_api.c @@ -505,7 +505,7 @@ static void test_scan_signatures(void) { {"POSITION", 0, 0, VKD3D_SHADER_SV_POSITION, VKD3D_SHADER_COMPONENT_FLOAT, 0, 0xf, 0xf}, {"TEXCOORD", 2, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 2, 0xf, 0xf}, - {"COLOR", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 0, 0xf, 0xf}, + {"COLOR", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 8, 0xf, 0xf}, {"FOG", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 1, 0x1, 0x1}, {"PSIZE", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 2, 0x1, 0x1}, }; @@ -582,7 +582,7 @@ static void test_scan_signatures(void) static const struct vkd3d_shader_signature_element ps2_inputs[] = { {"TEXCOORD", 2, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 2, 0xf, 0xf}, - {"COLOR", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 0, 0xf, 0xf}, + {"COLOR", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 8, 0xf, 0xf}, };
static const struct vkd3d_shader_signature_element ps2_outputs[] = @@ -603,7 +603,7 @@ static void test_scan_signatures(void)
static const struct vkd3d_shader_signature_element ps3_inputs[] = { - {"COLOR", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 0, 0xf, 0xf}, + {"COLOR", 0, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 8, 0xf, 0xf}, {"TEXCOORD", 2, 0, VKD3D_SHADER_SV_NONE, VKD3D_SHADER_COMPONENT_FLOAT, 2, 0xf, 0xf}, };
From: Zebediah Figura zfigura@codeweavers.com
--- include/vkd3d_shader.h | 74 ++++++++++++++++++++++++++++++++++++++- libs/vkd3d-shader/spirv.c | 35 ++++++++++++++++++ 2 files changed, 108 insertions(+), 1 deletion(-)
diff --git a/include/vkd3d_shader.h b/include/vkd3d_shader.h index 8133d240..d4ad2d07 100644 --- a/include/vkd3d_shader.h +++ b/include/vkd3d_shader.h @@ -90,6 +90,11 @@ enum vkd3d_shader_structure_type * \since 1.9 */ VKD3D_SHADER_STRUCTURE_TYPE_SCAN_SIGNATURE_INFO, + /** + * The structure is a vkd3d_shader_next_stage_info structure. + * \since 1.9 + */ + VKD3D_SHADER_STRUCTURE_TYPE_NEXT_STAGE_INFO,
VKD3D_FORCE_32_BIT_ENUM(VKD3D_SHADER_STRUCTURE_TYPE), }; @@ -1659,6 +1664,72 @@ struct vkd3d_shader_scan_signature_info struct vkd3d_shader_signature patch_constant; };
+/** + * A structure describing the mapping of a source varying (output) register in + * a shader stage, to a destination varying (input) register in the following + * shader stage. + * + * This structure is provided as an array, and therefore the source register is + * named implicitly by its index in the array. + * + * This structure is used in struct vkd3d_shader_next_stage_info. + */ +struct vkd3d_shader_varying_map +{ + /** The index of the destination varying to map this register to. */ + unsigned int dst_index; + /** The mask consumed by the destination register. */ + unsigned int dst_mask; +}; + +/** + * A chained structure which describes the next shader in the pipeline. + * + * This structure is optional, and should only be provided if there is in fact + * another shader in the pipeline. + * However, depending on the input and output formats, this structure may be + * necessary in order to generate shaders which correctly match each other. + * If the structure or its individual fields are not provided, vkd3d-shader + * will generate shaders which may be correct in isolation, but are not + * guaranteed to correctly match each other. + * + * For example, when compiling legacy Direct3D bytecode to SPIR-V, vkd3d-shader + * needs to assign a mapping of Direct3D usages to SPIR-V varying locations, + * and because legacy Direct3D bytecode need not specify locations in the same + * order between stages (unlike e.g. TPF) vkd3d-shader needs to know that + * mapping when compiling linked stages. + * + * This structure is passed to vkd3d_shader_compile() and extends + * vkd3d_shader_compile_info. + * + * This structure contains only input parameters. + * + * \since 1.9 + */ +struct vkd3d_shader_next_stage_info +{ + /** Must be set to VKD3D_SHADER_STRUCTURE_TYPE_NEXT_STAGE_INFO. */ + enum vkd3d_shader_structure_type type; + /** Optional pointer to a structure containing further parameters. */ + const void *next; + + /** + * A mapping of output varying registers in this shader stage to input + * varying registers in the next shader stage. + * + * If absent, vkd3d-shader will map registers directly based on their + * register index. + * + * This field should be provied when compiling from legacy Direct3D + * bytecode. + */ + const struct vkd3d_shader_varying_map *varying_map; + /** + * The number of registers provided in \ref varying_map. + */ + unsigned int varying_map_size; +}; + #ifdef LIBVKD3D_SHADER_SOURCE # define VKD3D_SHADER_API VKD3D_EXPORT #else @@ -1731,13 +1802,14 @@ VKD3D_SHADER_API const enum vkd3d_shader_target_type *vkd3d_shader_get_supported * * Depending on the source and target types, this function may support the * following chained structures: + * - vkd3d_shader_hlsl_source_info * - vkd3d_shader_interface_info + * - vkd3d_shader_next_stage_info * - vkd3d_shader_scan_descriptor_info * - vkd3d_shader_scan_signature_info * - vkd3d_shader_spirv_domain_shader_target_info * - vkd3d_shader_spirv_target_info * - vkd3d_shader_transform_feedback_info - * - vkd3d_shader_hlsl_source_info * * \param compile_info A chained structure containing compilation parameters. * diff --git a/libs/vkd3d-shader/spirv.c b/libs/vkd3d-shader/spirv.c index dcf957ea..2d0783f5 100644 --- a/libs/vkd3d-shader/spirv.c +++ b/libs/vkd3d-shader/spirv.c @@ -9455,6 +9455,38 @@ static void spirv_compiler_emit_sm1_constant_buffer(struct spirv_compiler *compi spirv_compiler_emit_constant_buffer(compiler, count, &range, ®); }
+static void remap_output_signature(struct shader_signature *signature, + const struct vkd3d_shader_compile_info *compile_info) +{ + const struct vkd3d_shader_next_stage_info *next_stage; + unsigned int i; + + if (!(next_stage = vkd3d_find_struct(compile_info->next, NEXT_STAGE_INFO)) + || !next_stage->varying_map_size) + return; + + for (i = 0; i < signature->element_count; ++i) + { + struct signature_element *e = &signature->elements[i]; + unsigned int reg_index = e->register_index; + unsigned int dst_mask; + + if (reg_index >= next_stage->varying_map_size) + { + ERR("No mapping provided for output varying register %u.\n", reg_index); + continue; + } + + e->register_index = next_stage->varying_map[reg_index].dst_index; + + /* It is illegal in Vulkan if the next shader uses the same varying + * location, but with a different mask. */ + dst_mask = next_stage->varying_map[reg_index].dst_mask; + if (dst_mask && dst_mask != e->mask) + FIXME("Output mask %#x does not match input mask %#x.\n", e->mask, dst_mask); + } +} + static int spirv_compiler_generate_spirv(struct spirv_compiler *compiler, const struct vkd3d_shader_compile_info *compile_info, struct vkd3d_shader_parser *parser, struct vkd3d_shader_code *spirv) @@ -9486,6 +9518,9 @@ static int spirv_compiler_generate_spirv(struct spirv_compiler *compiler, instructions = parser->instructions; memset(&parser->instructions, 0, sizeof(parser->instructions));
+ if (parser->shader_version.major < 4 && parser->shader_version.type == VKD3D_SHADER_TYPE_VERTEX) + remap_output_signature(&shader_desc->output_signature, compile_info); + compiler->input_signature = shader_desc->input_signature; compiler->output_signature = shader_desc->output_signature; compiler->patch_constant_signature = shader_desc->patch_constant_signature;
From: Zebediah Figura zfigura@codeweavers.com
--- configure.ac | 1 + include/private/vkd3d_common.h | 13 +++++++++ include/vkd3d_shader.h | 35 ++++++++++++++++++++++ libs/vkd3d-shader/vkd3d_shader_main.c | 42 +++++++++++++++++++++++++++ 4 files changed, 91 insertions(+)
diff --git a/configure.ac b/configure.ac index b58c2f4d..4a5e1f05 100644 --- a/configure.ac +++ b/configure.ac @@ -138,6 +138,7 @@ AS_IF([test "x$with_xcb" != "xno"],
dnl Check for functions VKD3D_CHECK_FUNC([HAVE_BUILTIN_CLZ], [__builtin_clz], [__builtin_clz(0)]) +VKD3D_CHECK_FUNC([HAVE_BUILTIN_FFS], [__builtin_ffs], [__builtin_ffs(1)]) VKD3D_CHECK_FUNC([HAVE_BUILTIN_POPCOUNT], [__builtin_popcount], [__builtin_popcount(0)]) VKD3D_CHECK_FUNC([HAVE_BUILTIN_ADD_OVERFLOW], [__builtin_add_overflow], [__builtin_add_overflow(0, 0, (int *)0)]) VKD3D_CHECK_FUNC([HAVE_SYNC_ADD_AND_FETCH], [__sync_add_and_fetch], [__sync_add_and_fetch((int *)0, 0)]) diff --git a/include/private/vkd3d_common.h b/include/private/vkd3d_common.h index f3242272..7098ff54 100644 --- a/include/private/vkd3d_common.h +++ b/include/private/vkd3d_common.h @@ -148,6 +148,19 @@ static inline unsigned int vkd3d_log2i(unsigned int x) #endif }
+static inline int vkd3d_bit_scan(unsigned int *x) +{ + int bit_offset; +#ifdef HAVE_BUILTIN_FFS + bit_offset = __builtin_ffs(*x) - 1; +#else + for (bit_offset = 0; bit_offset < 32; bit_offset++) + if (*x & (1u << bit_offset)) break; +#endif + *x ^= 1u << bit_offset; + return bit_offset; +} + static inline void *vkd3d_memmem( const void *haystack, size_t haystack_len, const void *needle, size_t needle_len) { const char *str = haystack; diff --git a/include/vkd3d_shader.h b/include/vkd3d_shader.h index d4ad2d07..6f1eaf15 100644 --- a/include/vkd3d_shader.h +++ b/include/vkd3d_shader.h @@ -1717,6 +1717,8 @@ struct vkd3d_shader_next_stage_info * A mapping of output varying registers in this shader stage to input * varying registers in the next shader stage. * + * This mapping should be constructed by vkd3d_shader_build_varying_map(). + * * If absent, vkd3d-shader will map registers directly based on their * register index. * @@ -2243,6 +2245,39 @@ VKD3D_SHADER_API int vkd3d_shader_serialize_dxbc(size_t section_count, */ VKD3D_SHADER_API void vkd3d_shader_free_scan_signature_info(struct vkd3d_shader_scan_signature_info *info);
+/** + * Build a mapping of output varyings in a shader stage to input varyings in + * the following shader stage. + * + * This mapping should be used in struct vkd3d_shader_next_stage_info to + * compile the first shader. + * + * This function should be called twice: once, with an input count of zero, in + * order to retrieve the necessary size of the varyings map, and again, with an + * input count containing the actual number of varyings to retrieve. + * + * \remark In practice, since register indices are assigned (from scanning) + * according to the actual Direct3D register index, it is safe to assume that + * the highest register index for a legacy Direct3D shader is no higher than 11, + * and hence one may simply call this function with a pre-allocated array of 12 + * varyings. + * + * \param output_signature The output signature of the first shader. + * + * \param input_signature The input signature of the second shader. + * + * \param count On input, this should point to an integer describing the + * allocated size, in elements, of the \ref varyings array. On output, contains + * the number of entries required. + * + * \param varyings Pointer to an output array of varyings. + * + * \since 1.9 + */ +VKD3D_SHADER_API void vkd3d_shader_build_varying_map(const struct vkd3d_shader_signature *output_signature, + const struct vkd3d_shader_signature *input_signature, + unsigned int *count, struct vkd3d_shader_varying_map *varyings); + #endif /* VKD3D_SHADER_NO_PROTOTYPES */
/** Type of vkd3d_shader_get_version(). */ diff --git a/libs/vkd3d-shader/vkd3d_shader_main.c b/libs/vkd3d-shader/vkd3d_shader_main.c index e8259181..0f2f3b7e 100644 --- a/libs/vkd3d-shader/vkd3d_shader_main.c +++ b/libs/vkd3d-shader/vkd3d_shader_main.c @@ -1891,3 +1891,45 @@ void shader_instruction_array_destroy(struct vkd3d_shader_instruction_array *ins vkd3d_free(instructions->icbs[i]); vkd3d_free(instructions->icbs); } + +void vkd3d_shader_build_varying_map(const struct vkd3d_shader_signature *output_signature, + const struct vkd3d_shader_signature *input_signature, + unsigned int *count, struct vkd3d_shader_varying_map *varyings) +{ + uint32_t unused_register_mask = ~0u; + unsigned int i; + + TRACE("output_signature %p, input_signature %p, count %u, varyings %p.\n", + output_signature, input_signature, *count, varyings); + + for (i = 0; i < input_signature->element_count; ++i) + { + unsigned int register_index = input_signature->elements[i].register_index; + + if (register_index < 32) + bitmap_clear(&unused_register_mask, register_index); + } + + for (i = 0; i < output_signature->element_count; ++i) + { + const struct vkd3d_shader_signature_element *input_element, *output_element; + unsigned int output_reg; + + output_element = &output_signature->elements[i]; + output_reg = output_element->register_index; + if ((input_element = vkd3d_shader_find_signature_element(input_signature, + output_element->semantic_name, output_element->semantic_index, 0))) + { + varyings[output_reg].dst_index = input_element->register_index; + varyings[output_reg].dst_mask = input_element->mask; + } + else + { + varyings[output_reg].dst_mask = 0; + if (unused_register_mask) + varyings[output_reg].dst_index = vkd3d_bit_scan(&unused_register_mask); + else + ERR("No unused register index available.\n"); + } + } +}
+/** + * A structure describing the mapping of a source varying (output) register in + * a shader stage, to a destination varying (input) register in the following + * shader stage. + * + * This structure is provided as an array, and therefore the source register is + * named implicitly by its index in the array. + * + * This structure is used in struct vkd3d_shader_next_stage_info. + */ +struct vkd3d_shader_varying_map +{ + /** The index of the destination varying to map this register to. */ + unsigned int dst_index; + /** The mask consumed by the destination register. */ + unsigned int dst_mask; +};
How do we represent registers not read by the next stage? vkd3d_shader_build_varying_map() seems to suggest we should map them to a free register, but I'm not sure that's ideal. It also sets "dst_mask" to 0, which is perhaps easier to work with. (So far that and the FIXME also seem like the only usage of "dst_mask" though...) In any case, the API documentation here leaves it unspecified.
Can we split registers? E.g., map o0.xyz in the vertex shader to v0.x and v1.zw in the pixel shader. Is that something we need to be able to do?
+struct vkd3d_shader_next_stage_info +{ + /** Must be set to VKD3D_SHADER_STRUCTURE_TYPE_NEXT_STAGE_INFO. */ + enum vkd3d_shader_structure_type type; + /** Optional pointer to a structure containing further parameters. */ + const void *next; + + /** + * A mapping of output varying registers in this shader stage to input + * varying registers in the next shader stage. + * + * If absent, vkd3d-shader will map registers directly based on their + * register index. + * + * This field should be provied when compiling from legacy Direct3D + * bytecode. + */ + const struct vkd3d_shader_varying_map *varying_map;
"provided"
Is this strictly necessary for d3dbc shaders? I suppose "should" leaves the possibility of not providing it open, but remap_output_signature() seems to care less about it being absent than "should" may imply. Note also that shader model 1 and 2 d3dbc shaders have a natural mapping between vertex shader outputs and pixel shader inputs.
```diff + /** + * The number of registers provided in \ref varying_map. + */ + unsigned int varying_map_size; +}; ```
So, "register_count"? :)
+static inline int vkd3d_bit_scan(unsigned int *x) +{ + int bit_offset; +#ifdef HAVE_BUILTIN_FFS + bit_offset = __builtin_ffs(*x) - 1; +#else + for (bit_offset = 0; bit_offset < 32; bit_offset++) + if (*x & (1u << bit_offset)) break; +#endif + *x ^= 1u << bit_offset; + return bit_offset; +}
"x" should probably be a uint32_t, although I think there's also an argument for making it uintptr_t sized.
I don't think we want the #else implementation; it seems preferable to #error out. On Windows we could potentially use _BitScanForward(), and POSIX provides ffs(), but even without either of those we could probably do better in plain C.
+ * This mapping should be constructed by vkd3d_shader_build_varying_map().
Should it? If we're going to prescribe that, we might as well just pass the nest stage input signature in struct vkd3d_shader_next_stage_info, and only provide a way to retrieve the output map used, whether that's vkd3d_shader_build_varying_map() or simply vkd3d_shader_scan().
+ * This function should be called twice: once, with an input count of zero, in + * order to retrieve the necessary size of the varyings map, and again, with an + * input count containing the actual number of varyings to retrieve. + * + * \remark In practice, since register indices are assigned (from scanning) + * according to the actual Direct3D register index, it is safe to assume that + * the highest register index for a legacy Direct3D shader is no higher than 11, + * and hence one may simply call this function with a pre-allocated array of 12 + * varyings.
It might be nice to guarantee an upper bound of output_signature->element_count. That might require representing "varyings" in a way that omits unused registers, but that might not be a bad thing.
Of course, an alternative way to avoid calling vkd3d_shader_build_varying_map() twice would be to have vkd3d_shader_build_varying_map() allocate the output map itself.
How do we represent registers not read by the next stage? vkd3d_shader_build_varying_map() seems to suggest we should map them to a free register, but I'm not sure that's ideal. It also sets "dst_mask" to 0, which is perhaps easier to work with. (So far that and the FIXME also seem like the only usage of "dst_mask" though...) In any case, the API documentation here leaves it unspecified.
dst_mask of zero, yes. Mapping them to a free register is done so that the current spirv code outputs them to a free location.Ideally we wouldn't output them at all, but I left that out of scope of the current patch.
Similarly, we *should* be outputting only the varyings that the PS uses [unless we have some Vulkan extension I forget the name of], but I left that out of scope of this patch as well. Point is, we have those things in the API so we can use them going forward.
Sort-of-underspecifying this struct, and how vkd3d_shader_build_varying_map() builds its output, was a deliberate choice, since I wanted to basically say "you should use those functions", and give us a bit of freedom at least as to what registers to actually use.
Can we split registers? E.g., map o0.xyz in the vertex shader to v0.x and v1.zw in the pixel shader. Is that something we need to be able to do?
Well, sm4 doesn't need it. As for sm1... I can get the native assembler to output two dcl instructions with the same semantic, but I don't know how that's interpreted. Is that something that you've encountered before?
I suspect that even if it is, the right thing to do here is handle that in the pixel shader, and convert it to a single semantic. I don't think our current spirv code is set up to handle it otherwise.
Is this strictly necessary for d3dbc shaders? I suppose "should" leaves the possibility of not providing it open, but remap_output_signature() seems to care less about it being absent than "should" may imply.
It's not necessary if the registers happen to match, but that is probably unlikely. I wouldn't object to making it a hard requirement for d3dbc.
Note also that shader model 1 and 2 d3dbc shaders have a natural mapping between vertex shader outputs and pixel shader inputs.
They do, though there is a snag. 1.x/2.x can output as many as 10 registers (8 texcoord and 2 colour. Also, is fog an issue here?), but early versions of GL might not be able to support that many locations (4.6 requires 16, as does Vulkan, but 3.0 seems to require only 8), which would force us to do remapping—and if the varyings aren't the same, I don't think we can remap in a consistent manner. So we may end up needing this struct for 1.x/2.x as well.
+ /** + * The number of registers provided in \ref varying_map. + */ + unsigned int varying_map_size; +};
So, "register_count"? :)
Sure, though that means it's not named like varying_map...
+static inline int vkd3d_bit_scan(unsigned int *x) +{ + int bit_offset; +#ifdef HAVE_BUILTIN_FFS + bit_offset = __builtin_ffs(*x) - 1; +#else + for (bit_offset = 0; bit_offset < 32; bit_offset++) + if (*x & (1u << bit_offset)) break; +#endif + *x ^= 1u << bit_offset; + return bit_offset; +}
"x" should probably be a uint32_t, although I think there's also an argument for making it uintptr_t sized.
It was done on the model of wined3d and the other functions here, but sure, I can change it.
I don't think we want the #else implementation; it seems preferable to #error out. On Windows we could potentially use _BitScanForward(), and POSIX provides ffs(), but even without either of those we could probably do better in plain C.
Same applies here, though more saliently, I'm not sure how to interpret the statements "I don't think we want the #else implementation" alongside "we could probably do better in plain C"?
+ * This mapping should be constructed by vkd3d_shader_build_varying_map().
Should it? If we're going to prescribe that, we might as well just pass the nest stage input signature in struct vkd3d_shader_next_stage_info, and only provide a way to retrieve the output map used, whether that's vkd3d_shader_build_varying_map() or simply vkd3d_shader_scan().
Well... "should", not "must", but moreover, the whole point here is that passing the signature to struct vkd3d_shader_next_stage_info got complicated in some ways, and this is nicer. Keeping this function around solves the caching problem, but:
(1) we'd have to reorganize some things in wined3d to allow passing the signature; it currently expects to use *only* the cache key as compile arguments. Minor problem, I guess, but if we *can* pass the cache key, I don't see a reason not to.
(2) this mapping is already kind of in optimal ("compiled", in a sense) form for vkd3d_shader_compile(), since we just look up the mapping in an array.
It might be nice to guarantee an upper bound of output_signature->element_count. That might require representing "varyings" in a way that omits unused registers, but that might not be a bad thing.
I originally was going to write that, but decided that since I was indexing the array by register_index, it wouldn't work. But on reflection there's no reason we can't index the array by the *signature* index, so yeah, that'd be better.
Can we split registers? E.g., map o0.xyz in the vertex shader to v0.x and v1.zw in the pixel shader. Is that something we need to be able to do?
Well, sm4 doesn't need it. As for sm1... I can get the native assembler to output two dcl instructions with the same semantic, but I don't know how that's interpreted. Is that something that you've encountered before?
I seem to recall seeing applications doing that, yes, though it's also long enough ago that I don't remember for sure.
Right now, I'm just looking for either "that can't possibly happen" or "we could handle that in this way, if needed". The backup is of course doing a v2 of the structure, but it would be nice to avoid that if we can.
Is this strictly necessary for d3dbc shaders? I suppose "should" leaves the possibility of not providing it open, but remap_output_signature() seems to care less about it being absent than "should" may imply.
It's not necessary if the registers happen to match, but that is probably unlikely. I wouldn't object to making it a hard requirement for d3dbc.
Note also that shader model 1 and 2 d3dbc shaders have a natural mapping between vertex shader outputs and pixel shader inputs.
They do, though there is a snag. 1.x/2.x can output as many as 10 registers (8 texcoord and 2 colour. Also, is fog an issue here?), but early versions of GL might not be able to support that many locations (4.6 requires 16, as does Vulkan, but 3.0 seems to require only 8), which would force us to do remapping—and if the varyings aren't the same, I don't think we can remap in a consistent manner. So we may end up needing this struct for 1.x/2.x as well.
Right, a caller may want to remap, but it may also not necessarily need to. Also, does that imply we may need to remap pixel shader inputs as well, along the lines of wined3d_pixel_shader's "input_reg_map"?
+ /** + * The number of registers provided in \ref varying_map. + */ + unsigned int varying_map_size; +};
So, "register_count"? :)
Sure, though that means it's not named like varying_map...
We could also say "varying_count", I suppose. The issue I have with using "_size" for fields like these is that it's often ambiguous whether it refers to the size of the allocation or the number of elements.
+static inline int vkd3d_bit_scan(unsigned int *x) +{ + int bit_offset; +#ifdef HAVE_BUILTIN_FFS + bit_offset = __builtin_ffs(*x) - 1; +#else + for (bit_offset = 0; bit_offset < 32; bit_offset++) + if (*x & (1u << bit_offset)) break; +#endif + *x ^= 1u << bit_offset; + return bit_offset; +}
"x" should probably be a uint32_t, although I think there's also an argument for making it uintptr_t sized.
It was done on the model of wined3d and the other functions here, but sure, I can change it.
Yeah, I figured. The reason wined3d's wined3d_bit_scan() takes an unsigned int is essentially just "history", but it would be nice to make this consistent with the other bitmap functions like bitmap_set() here.
I don't think we want the #else implementation; it seems preferable to #error out. On Windows we could potentially use _BitScanForward(), and POSIX provides ffs(), but even without either of those we could probably do better in plain C.
Same applies here, though more saliently, I'm not sure how to interpret the statements "I don't think we want the #else implementation" alongside "we could probably do better in plain C"?
We seem to have grown the #else path in wined3d_bit_scan() at some point during Wine's PE move, but I'd argue we never want to use it; iterating over all bits is just not a sufficient implementation for something like wined3d_stateblock_apply() or context_apply_draw_state().
As for doing better than a linear scan without using either POSIX or Win32 functions, plenty has been written about that by other people. One fairly decent approach is to use a de Bruijn sequence together with a small lookup table. Still, I think ffs() and _BitScanForward() should cover everything we care about.
+ * This mapping should be constructed by vkd3d_shader_build_varying_map().
Should it? If we're going to prescribe that, we might as well just pass the nest stage input signature in struct vkd3d_shader_next_stage_info, and only provide a way to retrieve the output map used, whether that's vkd3d_shader_build_varying_map() or simply vkd3d_shader_scan().
Well... "should", not "must",
I'd argue it's more of a "may".
Can we split registers? E.g., map o0.xyz in the vertex shader to v0.x and v1.zw in the pixel shader. Is that something we need to be able to do?
Well, sm4 doesn't need it. As for sm1... I can get the native assembler to output two dcl instructions with the same semantic, but I don't know how that's interpreted. Is that something that you've encountered before?
I seem to recall seeing applications doing that, yes, though it's also long enough ago that I don't remember for sure.
Right now, I'm just looking for either "that can't possibly happen" or "we could handle that in this way, if needed". The backup is of course doing a v2 of the structure, but it would be nice to avoid that if we can.
Sure.
Unless actually doing that does something unexpected, I think it would be covered from an API perspective by emitting multiple dcls as a single entry in the signature (with a combined mask), and then adding code in the spirv to blit to the individual register variables. Actually I think our signature scanning code is already written that way, and depending on how the I/O lowering pass is written, it may be as well.
Right, a caller may want to remap, but it may also not necessarily need to. Also, does that imply we may need to remap pixel shader inputs as well, along the lines of wined3d_pixel_shader's "input_reg_map"?
Broadly. The way I was picturing this would actually work would be that you'd pass the varying limit as an argument to v-s (as a compile option?), vs would automatically remap PS registers according to the limit (setting their register_index in the signature accordingly), and then vkd3d_shader_build_varying_map() would use those mapped register indices to build the varying map for the VS.
+ /** + * The number of registers provided in \ref varying_map. + */ + unsigned int varying_map_size; +};
So, "register_count"? :)
Sure, though that means it's not named like varying_map...
We could also say "varying_count", I suppose. The issue I have with using "_size" for fields like these is that it's often ambiguous whether it refers to the size of the allocation or the number of elements.
I think varying_count is close enough, yeah.
+static inline int vkd3d_bit_scan(unsigned int *x) +{ + int bit_offset; +#ifdef HAVE_BUILTIN_FFS + bit_offset = __builtin_ffs(*x) - 1; +#else + for (bit_offset = 0; bit_offset < 32; bit_offset++) + if (*x & (1u << bit_offset)) break; +#endif + *x ^= 1u << bit_offset; + return bit_offset; +}
"x" should probably be a uint32_t, although I think there's also an argument for making it uintptr_t sized.
It was done on the model of wined3d and the other functions here, but sure, I can change it.
Yeah, I figured. The reason wined3d's wined3d_bit_scan() takes an unsigned int is essentially just "history", but it would be nice to make this consistent with the other bitmap functions like bitmap_set() here.
I don't think we want the #else implementation; it seems preferable to #error out. On Windows we could potentially use _BitScanForward(), and POSIX provides ffs(), but even without either of those we could probably do better in plain C.
Same applies here, though more saliently, I'm not sure how to interpret the statements "I don't think we want the #else implementation" alongside "we could probably do better in plain C"?
We seem to have grown the #else path in wined3d_bit_scan() at some point during Wine's PE move, but I'd argue we never want to use it; iterating over all bits is just not a sufficient implementation for something like wined3d_stateblock_apply() or context_apply_draw_state().
As for doing better than a linear scan without using either POSIX or Win32 functions, plenty has been written about that by other people. One fairly decent approach is to use a de Bruijn sequence together with a small lookup table. Still, I think ffs() and _BitScanForward() should cover everything we care about.
Makes sense, thanks.
+ * This mapping should be constructed by vkd3d_shader_build_varying_map().
Should it? If we're going to prescribe that, we might as well just pass the nest stage input signature in struct vkd3d_shader_next_stage_info, and only provide a way to retrieve the output map used, whether that's vkd3d_shader_build_varying_map() or simply vkd3d_shader_scan().
Well... "should", not "must",
I'd argue it's more of a "may".
Sure, that's reasonable. I'll make it more clear what the API for this is, then.
It might be nice to guarantee an upper bound of output_signature->element_count. That might require representing "varyings" in a way that omits unused registers, but that might not be a bad thing.
I originally was going to write that, but decided that since I was indexing the array by register_index, it wouldn't work. But on reflection there's no reason we can't index the array by the *signature* index, so yeah, that'd be better.
Hmm, there is one extra snag here: PS inputs not written (at all) by the VS. wined3d cares enough to explicitly zero these, so I assume there are applications that care.
From the perspective of next_stage_info, I think the simple thing to do is say that varying_map is still indexed by output signature index, but also that it can have extra entries at the end for PS inputs not written by the VS.
From the perspective of vkd3d_shader_build_varying_map(), that means we can't just say that the size is output_signature->element_count; it might be greater than that. Allocation is possible but I'm a little resistant to that, since in practice the user should just be able to say that sm1 has a maximum inter-stage varying count of 12, provide that array, and call the function once [as my wined3d PoC does].
Unless you meant to write "an upper bound of input_signature->element_count" [which may make more sense given your second sentence]? That'd work, though it'd be meaner to remap_output_signature().
I originally was going to write that, but decided that since I was indexing the array by register_index, it wouldn't work. But on reflection there's no reason we can't index the array by the *signature* index, so yeah, that'd be better.
Ugh, but the annoying thing about doing this is that I/O normalization passes reorder the signature...
Unless you meant to write "an upper bound of input_signature->element_count" [which may make more sense given your second sentence]? That'd work, though it'd be meaner to remap_output_signature().
I think I did. It shouldn't be any meaner than "output_signature->element_count + some extra stuff" though; it's a different bound, but it doesn't say anything about the order by itself. I wouldn't necessarily be opposed to specifying this as e.g. a sorted array of "{src_index, dst_index, dst_mask}" either though.
I originally was going to write that, but decided that since I was indexing the array by register_index, it wouldn't work. But on reflection there's no reason we can't index the array by the *signature* index, so yeah, that'd be better.
Ugh, but the annoying thing about doing this is that I/O normalization passes reorder the signature...
I think it should only do that temporarily? shader_signature_merge() has a qsort() at the end that should restore the original order. It does of course potentially merge elements, which is perhaps just as bad.
Unless you meant to write "an upper bound of input_signature->element_count" [which may make more sense given your second sentence]? That'd work, though it'd be meaner to remap_output_signature().
I think I did. It shouldn't be any meaner than "output_signature->element_count + some extra stuff" though; it's a different bound, but it doesn't say anything about the order by itself. I wouldn't necessarily be opposed to specifying this as e.g. a sorted array of "{src_index, dst_index, dst_mask}" either though.
Well, the current order allows remap_output_signature() just to look up by register_index or by signature index. Either of the above options would instead require us to do a secondary loop. From a performance standpoint that may not meaningfully matter, though; I guess we're going to do that loop once *somewhere*.
I originally was going to write that, but decided that since I was indexing the array by register_index, it wouldn't work. But on reflection there's no reason we can't index the array by the *signature* index, so yeah, that'd be better.
Ugh, but the annoying thing about doing this is that I/O normalization passes reorder the signature...
I think it should only do that temporarily? shader_signature_merge() has a qsort() at the end that should restore the original order. It does of course potentially merge elements, which is perhaps just as bad.
Yeah, in theory it does, but that's not working out for some reason. I don't think I fully investigated why, but if we're messing with the signature order at all then I think we want to remap *before* doing I/O normalization. Which isn't actually ugly, I just had some amount of tunnel vision there.