I suppose it depends on your standard for how long a file can be. I feel like more than a few thousand lines can start to feel like an awful lot, and at that point the compilation time does start to show.
Personally I think modularity is the most important thing, and I find it somehow easier to mentally work with files that segregate their components. In this case I think segregating the reader and writer would not have been a bad idea, and keeping the sm4 definitions in a separate file might also have been nice even if not.
Obviously by that principle, any sufficiently modular piece of code can be split into its own file, however small. If it's only a few hundred lines, or a couple of functions (honestly, the number of [well-formed?] functions may matter less to me than the line count) I'd be inclined against it, but I think I have a lower threshold for what seems reasonable to split up. Take HLSL copy propagation, for instance, which is only about 500 lines, but on the other hand spans a whole 20 functions; that feels to me like a large enough self-contained chunk of code that it's worth at least considering segregation.
I don't want to be insensitive to the concern, particularly since both Giovanni and you raised it, but I just don't get it? Navigating within a file seems at least as easy as navigating between files, and files on the order of ~2k lines in case of d3dbc.c and ~5k lines in case of tpf.c don't seem particularly unusual in either vkd3d or Wine. (I had been thinking we might as well merge hlsl.c, hlsl_codegen.c and hlsl_constant_ops.c, but I guess not then...)
In any case, while it's perhaps not quite true at the moment, parsing and serialising the different bytecode formats should be closely related and use the same constants, structures, and so on. There's currently some HLSL-specific code intermixed that would more properly belong in hlsl.c/hlsl_codegen.c, but we can deal with that once we get to it.
...Possibly. I would have appreciated a bit more discussion on this first. I had been vaguely thinking of moving more code *to* the backend files, so now I need to stop and re-plan.
And while I think we'd always been talking about using the same structures in hlsl_smX, I kind of thought that we would perform that conversion before moving things around. That would have let things be done more gradually, and avoided some unpleasant rebases.
Anyway, generally this ties into the question I brought up in [1]. I brought up a relatively complete summary of our options with the IR there, but it doesn't seem to have garnered any discussion. At best we seem to have committed to a "maximally CISC" instruction set with Conor's normalization patches, so that rules out option (2).
Well, yeah, the impression I came away with from that conversation was that there were no particularly strong preferences, and not much of a general consensus either; few people seem to care. I don't think we've ruled out any of the options presented there at this point. As noted there, (1) is essentially the current state, but we could certainly move away from that towards one of the other options.
And ultimately I think the distinction between the options is a bit artificial to begin with; it ends up largely being a matter of doing particular transformation passes on either the frontend IR, the intermediate IR, or the backend IR. That's not a decision we need to make upfront for every conceivable transformation, and we can change our minds if a particular choice doesn't quite work out. In the case of Conor's tessellation shader work, the consideration is simply that these transformations are more practical to do on the vkd3d_shader_instruction level than on the SPIR-V level, and while doing them on the TPF level would also be plausible, that would require doing a bunch of disassembler work first.