General comments:
* This will take up up to 32 MiB per wg_parser, which seems like an awful lot, especially since applications can often create more than one of these at a time. I'm concerned about the memory usage and address space implications. Can the chunk size get any smaller without negatively impacting performance again?
* The code as written does handle loads that span chunk boundaries, although I feel like it's done in a non-obvious way. Reading through the loop in src_getrange_cb() I'm left wondering "why would read_parser_cache() return less than the requested size?" and "why are read_parser_cache() and load_parser_cache() separate functions?"
* And along similar lines, is it even worth falling back to the old path for larger read requests? If not, we should factor out a helper to actually perform the read pseudo-callback.
* Instead of a "rank" field, I think it would be simpler just to make the index itself be the rank, and just memmove the entries every time.
* I don't like the naminng of "parser_cache"; it's both redundant (everything in the file has to do with the parser) and not specific enough (caching what?) I'd propose "input_cache" or "read_cache" here.
* The parser mutex doesn't seem to be taken around everything that it should be.