https://bugs.winehq.org/show_bug.cgi?id=46208
Bug ID: 46208 Summary: Elite Dangerous is unstable in wine-devel Product: Wine Version: 3.21 Hardware: x86-64 OS: Linux Status: UNCONFIRMED Severity: normal Priority: P2 Component: ntdll Assignee: wine-bugs@winehq.org Reporter: brendan@redmandi.com Distribution: ---
When using wine-devel v3.21, Elite Dangerous is unstable and will usually freeze before finishing the opening video (but will sometimes get as far as shader compilation).
When attaching a debugger, it looks like all processes are waiting for an event.
The program needs to be killed with a 'kill -9'.
The workaround is to run Elite with 'taskset -c 0', but it also appears to progress further if you run with 'WINEDEBUG=+relay'.
This leads me to believe the issue is one of timing (as opposed to multi-threading).
When using wine-staging v3.21, Elite has no such problems and runs perfectly.
A git bi-sect led me to find that the reason for this was a single patch: https://github.com/wine-staging/wine-staging/tree/master/patches/ntdll-futex...
Applying just this patch to wine-devel v3.21 results in the game running fine.
https://bugs.winehq.org/show_bug.cgi?id=46208
Zebediah Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |z.figura12@gmail.com
--- Comment #1 from Zebediah Figura z.figura12@gmail.com --- Created attachment 62890 --> https://bugs.winehq.org/attachment.cgi?id=62890 debug traces
It seems quite likely that there is a missed wake somewhere. Can you please try running with the attached patch, and attach a log with +timestamp?
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #2 from Brendan McGrath brendan@redmandi.com --- Created attachment 62891 --> https://bugs.winehq.org/attachment.cgi?id=62891 Logs of Elite Dangerous failing
The attached tar file has five files in it: - ED_wake.log - ED_failv1.log - ED_failv2.log - ED_failv3.log - ED_fail_log.patch
The ED_wake.log is using your patch, the other three log files are using the ED_fail_log.patch (which is just your patch with a few more values included).
I also have one other log which I will attach next (as it didn't fit the 5MB limit) using the taskset -c 0 command.
I found that with taskset, the loop in 'WakeAll' doesn't appear to be pre-empted; so it completes before anyone else wants to sleep on that event.
But with multiple CPUs, it seems both sleep and wake requests occur against this event whilst the 'WakeAll' is in progress. This results in the 'wake' requests being skipped.
If you run: sed '/0x510418, 0 (skipped)/!d;N;/(skipped).*(skipped)/!d;N;/(skipped).*(skipped).*(skipped)/!d'
against any of the failed logs you'll find multiple occasions where there are consecutive 'WakeAll' requests being skipped. But this doesn't happen even once in the 'taskset' log.
Once a failure occurs, you just get an infinite loop of timeouts on the wait request.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #3 from Brendan McGrath brendan@redmandi.com --- Created attachment 62892 --> https://bugs.winehq.org/attachment.cgi?id=62892 Logs of Elite Dangerous working (using 'taskset -c 0' work-around)
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #4 from Zebediah Figura z.figura12@gmail.com --- Thanks. Starting from about line 219679 of your ED_failv1.log I see some suspicious behaviour: thread 0037 waits on the lock, and then is not subsequently woken up despite an RtlWakeAllConditionVariable() in line 219702. In fact, thread 0037 seems to get stuck forever, as it never times out either, which seems suspicious, as it has timed out before, although perhaps in that case NULL was passed?
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #5 from Zebediah Figura z.figura12@gmail.com --- I think I see the bug, or at least a bug. We can run into the following race (simplified slightly):
wake-all waiter 1 waiter 2 ---------------------------------------- timeout val 2 -> 0 val 0 -> 0 ReleaseKeyed wakeup WaitForKeyed ReleaseKeyed wakeup WaitForKeyed (WaitForKeyed) (WaitForKeyed) ----------------------------------------
If this is what's going on here, waiter 1 (i.e. in this case 0037) gets stuck forever waiting for a Release() that doesn't happen.
Our implementation of ReleaseKeyedEvent() relies on the fact that the first N waiters on the queue are the N waiters on the event. Normally this is true. Even when one waiter times out and subsequently requeues itself (n.b. the requeue is so that a release doesn't block trying to wake someone who isn't waiting), it'll normally requeue itself as the Nth waiter on the event. The problem arises when the second waiter thread calls RtlSleepConditionVariable() again, before RtlWakeAllConditionVariable() is done executing; then that waiter gets woken up twice, and the first waiter not at all; and more importantly the first waiter ends up waiting forever, which probably wasn't intended.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #6 from Brendan McGrath brendan@redmandi.com --- That certainly explains what we see and why the problem doesn't occur in the futex implementation.
I guess the current implementation needs a fix regardless (given kernel futex support won't always be available).
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #7 from Zebediah Figura z.figura12@gmail.com --- (In reply to Brendan McGrath from comment #6)
That certainly explains what we see and why the problem doesn't occur in the futex implementation.
I guess the current implementation needs a fix regardless (given kernel futex support won't always be available).
Yes, definitely. I'm not entirely sure how, though. I'll have to give it some thought.
Possibly we can implement it on top of RtlWaitOnAddress(), which deserves a futex path anyway.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #8 from Zebediah Figura z.figura12@gmail.com --- Created attachment 62905 --> https://bugs.winehq.org/attachment.cgi?id=62905 ntdll: Always release all threads in RtlWakeAllConditionVariable().
Does this patch (applied to vanilla wine) fix the bug?
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #9 from Brendan McGrath brendan@redmandi.com --- Created attachment 62907 --> https://bugs.winehq.org/attachment.cgi?id=62907 Logs of Elite Dangerous failing with new patch
I actually tried that exact patch myself - but it locks even earlier. At the time I didn't investigate why - but this time I did.
The attached tar has two files: - ED_loop.log - ED_loop_trun.log
ED_loop_trun.log is just a truncation of ED_loop.log showing the bit that I think is important - which is the beginning of a loop that never ends: 2823.864:0037:fixme:ntdll:RtlSleepConditionVariableSRW woken: 00000000 2823.864:0037:fixme:ntdll:RtlWakeAllConditionVariable 0x510418, 6 2823.864:0037:fixme:ntdll:RtlWakeAllConditionVariable waking 0x510418,6 2823.864:004a:fixme:ntdll:RtlSleepConditionVariableSRW 0x510418, 6 2823.864:004c:fixme:ntdll:RtlSleepConditionVariableSRW woken: 00000000 2823.864:0037:fixme:ntdll:RtlWakeAllConditionVariable waking 0x510418,5 2823.864:004c:fixme:ntdll:RtlSleepConditionVariableSRW 0x510418, 5 2823.864:0037:fixme:ntdll:RtlWakeAllConditionVariable waking 0x510418,5 2823.864:004d:fixme:ntdll:RtlSleepConditionVariableSRW woken: 00000000 2823.864:004d:fixme:ntdll:RtlSleepConditionVariableSRW 0x510418, 5 2823.864:004f:fixme:ntdll:RtlSleepConditionVariableSRW woken: 00000000 2823.864:0037:fixme:ntdll:RtlWakeAllConditionVariable waking 0x510418,5 2823.864:004f:fixme:ntdll:RtlSleepConditionVariableSRW 0x510418, 5
What looks to be happening is that every-time a thread is woken, it quickly adds itself back to the queue which means the loop never gets to finish (as the value in variable->Ptr keeps going up as quickly as it's going down).
The next thing I tried was to move the WakeAll down to the server (in the hope that I could then provide a locking mechanism to make it atomic). But I never got the WakeAll working at the server level. I think I need to look more closely at what SELECT_KEYED_EVENT_WAIT and SELECT_KEYED_EVENT_RELEASE are doing.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #10 from Zebediah Figura z.figura12@gmail.com --- Yeah, I was admittedly sort of afraid something like that might happen. Needs more thought.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #11 from Zebediah Figura z.figura12@gmail.com --- (In reply to Brendan McGrath from comment #9)
The next thing I tried was to move the WakeAll down to the server (in the hope that I could then provide a locking mechanism to make it atomic). But I never got the WakeAll working at the server level. I think I need to look more closely at what SELECT_KEYED_EVENT_WAIT and SELECT_KEYED_EVENT_RELEASE are doing.
This would probably work, but it's maybe not worthwhile. One alternative is to have an implementation almost identical to the futex path, but which uses RtlWaitOnAddress()/RtlWakeAddress() instead. It'd be horrendously slow, but I think we should be able to use native sync objects on all the platforms we care about.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #12 from Brendan McGrath brendan@redmandi.com --- (In reply to Zebediah Figura from comment #11)
This would probably work, but it's maybe not worthwhile.
Yeah - I didn't peruse it very hard. Mainly because: a) I don't what I'm doing; and b) You guys do
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #13 from Brendan McGrath brendan@redmandi.com --- (In reply to Zebediah Figura from comment #11)
One alternative is to have an implementation almost identical to the futex path, but which uses RtlWaitOnAddress()/RtlWakeAddress() instead. It'd be horrendously slow, but I think we should be able to use native sync objects on all the platforms we care about.
This works. As a proof of concept I applied the futex patch and modified the futex_wait and futex_wake to be:
static inline int futex_wait( int *addr, int val, struct timespec *timeout ) { LARGE_INTEGER li_timeout; LARGE_INTEGER *p_li_timeout;
if (timeout) { li_timeout.LowPart = timeout->tv_sec * TICKSPERSEC + timeout->tv_nsec; p_li_timeout = &li_timeout; } else { p_li_timeout = NULL; }
return RtlWaitOnAddress( addr, &val, sizeof(val), p_li_timeout ); }
static inline int futex_wake( int *addr, int val ) { if (val == INT_MAX) RtlWakeAddressAll(addr); else RtlWakeAddressSingle(addr);
return STATUS_SUCCESS; }
Obviously my 'struct timeout' to LARGE_INTEGER conversion is prone to overflow and my formatting is terrible - but this was just proof of concept.
It worked without needing the two recent patches you submitted for RtlWaitOnAddress too.
https://bugs.winehq.org/show_bug.cgi?id=46208
Andrew Eikum aeikum@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |aeikum@codeweavers.com
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #14 from Brendan McGrath brendan@redmandi.com --- I'm still hoping to see resolution to this bug prior to the release of version 4.0.
I'm not sure if at this late stage I should concede defeat or if I should: a) resubmit the Futex patch from staging; or b) create a patch using RtlWaitOnAddress()/RtlWakeAddress()
The Futex patch does have a noticeable performance benefit over the proof of concept I created (but the proof of concept has plenty of room for improvement).
Or are you currently working on a resolution to this?
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #15 from Zebediah Figura z.figura12@gmail.com --- (In reply to Brendan McGrath from comment #14)
I'm still hoping to see resolution to this bug prior to the release of version 4.0.
I'm not sure if at this late stage I should concede defeat or if I should: a) resubmit the Futex patch from staging; or b) create a patch using RtlWaitOnAddress()/RtlWakeAddress()
The Futex patch does have a noticeable performance benefit over the proof of concept I created (but the proof of concept has plenty of room for improvement).
Or are you currently working on a resolution to this?
Yes, I am currently working on resolution to this. It's likely that a path using RtlWaitOnAddress() is the best way to go, but that will need bug 46099 fixed first.
It's likely this won't be ready for 4.0, which is unfortunate, but in that case it'll just have to wait until after code freeze. If nothing else the futex patch will stay in Staging until it's ready for upstream.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #16 from Brendan McGrath brendan@redmandi.com --- OK - thanks for the update Zebediah. wine-staging has served us well up until now so I'm sure it will continue to do so.
https://bugs.winehq.org/show_bug.cgi?id=46208
--- Comment #17 from Zebediah Figura z.figura12@gmail.com --- This should be fixed now, https://source.winehq.org/git/wine.git/commitdiff/954ed39584ba30fe194558cfbafa84aa1d3b2844. Please retest with current git or when 4.2 is released.
https://bugs.winehq.org/show_bug.cgi?id=46208
Brendan McGrath brendan@redmandi.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
--- Comment #18 from Brendan McGrath brendan@redmandi.com --- Thanks Zebediah. I can confirm I've had 100% success with this patch; so I'm happy to classify this bug as fixed.
https://bugs.winehq.org/show_bug.cgi?id=46208
Zebediah Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Fixed by SHA1| |954ed39584ba30fe194558cfbaf | |a84aa1d3b2844
--- Comment #19 from Zebediah Figura z.figura12@gmail.com --- Thanks for testing (and apologies for the delay in upstreaming).
https://bugs.winehq.org/show_bug.cgi?id=46208
Alexandre Julliard julliard@winehq.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED
--- Comment #20 from Alexandre Julliard julliard@winehq.org --- Closing bugs fixed in 4.2.