If a message doesn't match, and does not exist in the expected sequence, only the actual message will be skipped. Otherwise, only the expected message will be skipped. This isn't perfect, but I think it will help make the output from failures more readable.
It might sound a bit complicated, but I think it would be better to actually compute the longest common subsequence (like diff) than add more heuristics.