For a shader with many `texkill` instructions that's going to copy most of it many times over. Rather than inserting in place I'd suggest to simply rewrite the whole array. This is, at least, the approach I'm following with the preliminary passes for my CFG structurizer, see for example https://gitlab.winehq.org/giomasce/vkd3d/-/commit/b771e30fd5c5f52478c9f92f8c... (but notice that's still work in progress).
Pre-sm4 shaders tend to not be especially large, simply because of the limitations of the hardware at the time. They tend to not have a large number of texkill instructions either, so I'm not too concerned about that scenario in practice. More importantly, the API introduced here seems flexible enough to allow for more efficient implementations.