https://bugs.winehq.org/show_bug.cgi?id=50680
Zebediah Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Fixed by SHA1| |45230b51db703933dc6108c526a | |9eb872416981a Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
--- Comment #24 from Zebediah Figura z.figura12@gmail.com --- Yeah.
Okay, so here is my working theory:
The game misuses condition variables. I've disassembled three or four different call sites and I see code like this:
EnterCriticalSection(&apple->cs); if (apple->x || SleepConditionVariableCS(&apple->cv, &apple->cs, INFINITE) != ERROR_TIMEOUT) --apple->x; LeaveCriticalSection(&apple->cs);
Which is wrong for multiple reasons, but the most important one is that it doesn't call SleepConditionVariableCS() in a loop. Which means that a stolen or spurious wakeup will result in incorrect behaviour.
(Granted, it's possible that I'm wrong, and that the relevant functions are being called in a loop somehow, or that "x" is used in a way that is ultimately safe against spurious wakeups. I don't think this is likely, especially given the other errors made around condition variables [e.g. in one place they call SleepConditionVariableCS() without holding the CS], but it's always possible.)
I suspect that it works consistently on Windows because the Windows implementation of condition variables does not cause spurious wakeups (at least, not if there's only one waiting thread). This fact is hard to corroborate. I ended up writing a test that I think roughly checks whether spurious wakeups matter in the way that Detroit cares about, which I'll attach. I get spurious wakeups every 10 seconds or so on Windows. On the other hand, they show up pretty constantly on Linux, with the tid alert patches.
The reason for the behaviour we've seen, then, is as follows:
(1) With current upstream wine, we use the condition variable word as a futex directly. As far as I can tell, the Linux implementation of futexes doesn't actually cause any spurious wakeups under "normal" circumstances. I'm not sure of this, but I can't find any reason in the code that it does.
(2) With current wine-staging, i.e. with all of the tid alert patches applied, we implement condvars on top of win32 futexes on top of tid alerts (on top of linux futexes, but that ends up not mattering). There are basically three ways we can get a wakeup without going all the way into the scheduler:
(a) Return early from RtlSleepConditionVariable*() because the condition didn't match. This isn't a problem, because we're not queued, and thus a subsequent wake will wake nothing.
(b) Return early from RtlWaitOnAddress() because the address didn't match. This can be a problem, because we *are* queued, and if a subsequent wake happens before we unqueue ourselves, we get wake type (c), which is...
(c) A "buffered" wake from a previous RtlWaitOnAddress() call. In the above scenario, the waking thread calls NtAlertThreadByThreadId() after the waiter is woken up (either from a type (b) wake or a timeout), but before it unqueues itself. TID alerts are basically just fancy events, so the wake is buffered until the next call to NtWaitForAlertByThreadId(). In theory Windows suffers from this as well. In practice my attempts to reproduce it have been met with failure, and I'm not sure why, although I have at least one guess.
(3) With patch 0010 applied by itself, we don't get the problem from (2). What happens instead is that we get spurious wakes due to hash collision. In order to verify this, I tried increasing the hash table size, to avoid collision, hence the patch from comment 21. And this indeed made the spurious wakes go away, and let the game start working.
Because we currently use spinlocks in the PE side anyway, the easiest way to avoid buffered wakes is actually to do the comparison inside the spinlock (much like we currently do upstream on the Unix side when futexes aren't available). This is what was committed in wine-staging as 45230b51db703933dc6108c526a9eb872416981a, and it seems to consistently fix this bug.