The problem with "flaky", especially in a situation like this where it's being apparently abused for a test that isn't flaky at all but actually fails every time, is that it makes the test meaningless.
That might be okay, especially if Windows doesn't consistently pass either, but in that case we should just remove the line entirely.
Stefan, is this test important at all? I assume that WM_WINDOWPOSCHANGED is important enough that applications care about it, since we bother to explicitly test for it—if so, can we figure out how to rewrite the test to closely enough match the application behaviour so that it consistently passes?