I just wanted to add that exactly the same problem exists with graphics drivers. There, we use approximately this scheme:
-> If there's a driver difference, and both driver behaviors are somewhat sane, then we accept both results. An application couldn't depend on a specific result either
-> If the behavior of one driver is not sane, and the functionality tested is somewhat exotic(e.g. fixed function vertex processing with non-standard attribute types), then we accept the failure as well if there is no known application that uses the feature
-> If a feature is obviously broken and it is a more common feature(e.g. texdepth or texkill on a Radeon 9000), then I just let the test fail on Windows; any game that uses those features will fail as well. (Luckily for the card no ps_1_4 game uses texdepth and texkill; only later ones do)
The reference rasterizer is just another "driver" for me. There are some behaviors in per-MSDN undefined cases where the refrast shows a behavior that is known to cause problems with a game. In that case we let it fail on the refrast as well. The Intel and VMware drivers aren't drivers I care for because many games are known to fail on them on Windows.
Whenever a test is known to fail I add a comment to the test implementation.