https://bugs.winehq.org/show_bug.cgi?id=53372
--- Comment #9 from Zeb Figura z.figura12@gmail.com --- I managed to get access to an NVidia machine and was able to reproduce the same deadlock on startup.
As described above, the changes to d3d don't make a lot of sense, so I tried double-checking, and I think I found a different commit that's to blame. It's hard to be sure, because the deadlock is tetchy and sometimes won't reproduce until the dozenth run, but I think the offending commit is:
commit 18ae96e5fb3cbbd53f1a022ba81203de6b431228 Author: Zhiyi Zhang zzhang@codeweavers.com Date: Mon Apr 25 17:22:16 2022 +0800
winex11.drv: Lock display when expecting error events.
If the display is not locked, another thread could take the error event and handle it with the default error handlers and thus not handled by the current thread with the specified error handlers.
Fix Cladun X2 crash at start.
Signed-off-by: Zhiyi Zhang zzhang@codeweavers.com
More interestingly, if I look at the process state when the game is hung, I notice that, while the main thread is locked at 100% CPU (waiting for the CS thread), the CS thread is sleeping, and another CS thread which was already shut down is also sleeping. Further tracing shows that the "old" CS thread is terminated and not running any more win32 code, but hasn't actually exited. And I'm unable (after a few tries) to reproduce with csmt=0.
My suspicion, although I have no way to verify this, is that the NVidia driver is deadlocking because of a lock ordering problem. I am guessing that it does thread cleanup with pthread_cleanup_push(), and that inside of that it grabs some internal lock (a GLX context lock?) and then calls XLockDisplay(), and that glXCreateContext() grabs the same lock, resulting in a lock inversion when the latter is called while already in XLockDisplay().
If I'm right, I don't know whose bug this really is. XLockDisplay() is part of libx11, not the X11 protocol, and while libx11 is documented the behaviour of threading like this doesn't seem to be specified. If I had to give a reading, though, I'd say that since there's nothing in the documentation preventing us from calling glXCreateContext() with a locked display (and since we have a good reason to do so) this is NVidia's bug.
Patrick, does reverting 18ae96e5fb help? I'd expect it to at least get rid of the deadlock (although it's possible that I haven't sufficiently tested and that my analysis is wrong) but it may not get rid of the OOM errors—those may have a separate cause (e.g. the streaming buffer from 66f37aae7e2 is somehow growing too large, which would make a lot more sense...)
And if not, does turning off CSMT help?
I was able to reproduce some OOM errors, but they went away after applying both of the merge requests I linked earlier. (Which are, by the way, both upstream by now.)
Probably the hang on startup should be split out to a different bug in any case.