https://bugs.winehq.org/show_bug.cgi?id=53682
--- Comment #11 from Martin Storsjö martin@martin.st --- Thanks for the pointers! As discussed on irc, now I was able to reproduce it - it didn't reproduce when building the ELF code with Clang, and it didn't reproduce with Ubuntu 20.04's GCC 9.4.0, but it did reproduce with Ubuntu 22.04's GCC 11, and with vanilla GCC 9.2 built on Ubuntu 20.04.
I think I understand the issue now. So basically, __wine_syscall_dispatcher assumes that syscall_frame is at the very bottom of the syscall stack.
When setting up a new syscall_frame on the current stack, and calling __wine_syscall_dispatcher_return, we'd need to make sure that the new syscall_frame is at the very exact bottom of the stack.
With local variables in C, there is effectively no such guarantee - the compiler is free to do whatever stack layout it wants to. The attached patch, with the separate "DECLSPEC_NORETURN __attribute__((noinline)) call_user_mode_callback", there's a much greater chance of this still holding up (with a noreturn function, there's little point for the compiler to store anything else on the stack. But the compiler still could. (E.g., I recently looked at how MSVC allocates aligned stack objects, and if e.g. syscall_frame would have a larger alignment than the default 16 bytes, MSVC could leave a gap at the bottom of the stack.)
So one way of guaranteeing the stack layout, is to use an assembly wrapper. Either that's a lot of assembly, or we'd do a little bit of duplicate work with a small and neat assembly wrapper (setting up a user_callback_frame in a C function, then having the small assembly function just memcpy it to the right stack location and call __wine_syscall_dispatcher_return).
There's another small fix that does seem to work for me, but it's not a water tight solution (but it can be made good enough). When __wine_syscall_dispatcher gets its syscall_frame, it then also goes on to call the syscall with the stack pointer set exactly to this value. But we can just as well add a small gap between the previous syscall_frame and the stack pointer we set before handing over to the actual syscall implementation. This would waste a couple bytes of stack, but seems to work for me.