In the past when I've added tests for behavior that differs from one Windows version to another, I've been asked to designate one behavior as the "correct" behavior and mark the other as broken.
That doesn't mean that's always the right thing to do. It takes some amount of judgement, I think. Behaviour that's specific to some Windows versions is an argument in favour of broken(), and so is behaviour that simply doesn't make sense, and behaviour that contradicts the documentation. And vice versa. But I don't think there are any hard rules.