Also I don't think that we loose any actual parallelism by not using separate threads for the different streams: they work inside a single pulse mutex lock anyway and would the processing overlap they would be synchronized anyway. So getting rid of extra threads (as well as simplification with removing adjustment logic on relative delay) looks like a minor improvement to me on its own.