-- v3: ntdll: Do not abort the thread if the server pipe is broken.
From: Zebediah Figura zfigura@codeweavers.com
Keil UV4 C51 starts a thread and then immediately terminates it, usually while it's in the middle of an NtClose(). This results in the thread dying while holding the fd_cache_handle mutex, which results in a hang from other threads in the process.
We don't need to process SIGQUIT immediately, so just mask it off until we're out of the uninterrupted section.
This will require the next commit to fix the bug.
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=54807 --- dlls/ntdll/unix/server.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/dlls/ntdll/unix/server.c b/dlls/ntdll/unix/server.c index 53fcc61ccf3..057efb5e917 100644 --- a/dlls/ntdll/unix/server.c +++ b/dlls/ntdll/unix/server.c @@ -1579,6 +1579,7 @@ size_t server_init_process(void) sigaddset( &server_block_set, SIGIO ); sigaddset( &server_block_set, SIGINT ); sigaddset( &server_block_set, SIGHUP ); + sigaddset( &server_block_set, SIGQUIT ); sigaddset( &server_block_set, SIGUSR1 ); sigaddset( &server_block_set, SIGUSR2 ); sigaddset( &server_block_set, SIGCHLD );
From: Zebediah Figura zfigura@codeweavers.com
Keil UV4 C51 starts a thread and then immediately terminates it, usually while it's in the middle of an NtClose().
The previous commit prevents the SIGQUIT from killing the thread before it exits the fd_cache_section mutex. However, the server will still have closed its side of the socket, resulting in the close_handle server request receiving EPIPE from write(), or zero from read(), at which point the client calls abort_thread() anyway, while still holding the mutex.
As far as I can tell, the point of aborting here is to avoid letting a client process remain forever if the server dies. This is a sensible goal; however, it cannot always work in theory (a thread which never makes a server call will never reach this point and will remain anyway) and, speaking from experience, it almost never works in practice (every time the server dies I have to manually kill most processes).
In summary, if the server intentionally closes a client thread's socket fd, we can rely on the accompanying SIGQUIT to kill the client thread, and if the server dies unexpectedly, the client was most likely going to hang anyway. Since the abort is not really helping, and is now causing problems, just remove it.
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=54807 --- dlls/ntdll/unix/server.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/dlls/ntdll/unix/server.c b/dlls/ntdll/unix/server.c index 057efb5e917..ef6f5290efd 100644 --- a/dlls/ntdll/unix/server.c +++ b/dlls/ntdll/unix/server.c @@ -206,7 +206,7 @@ static unsigned int send_request( const struct __server_request_info *req ) }
if (ret >= 0) server_protocol_error( "partial write %d\n", ret ); - if (errno == EPIPE) abort_thread(0); + if (errno == EPIPE) return STATUS_THREAD_IS_TERMINATING; if (errno == EFAULT) return STATUS_ACCESS_VIOLATION; server_protocol_perror( "write" ); } @@ -217,7 +217,7 @@ static unsigned int send_request( const struct __server_request_info *req ) * * Read data from the reply buffer; helper for wait_reply. */ -static void read_reply_data( void *buffer, size_t size ) +static unsigned int read_reply_data( void *buffer, size_t size ) { int ret;
@@ -225,7 +225,7 @@ static void read_reply_data( void *buffer, size_t size ) { if ((ret = read( ntdll_get_thread_data()->reply_fd, buffer, size )) > 0) { - if (!(size -= ret)) return; + if (!(size -= ret)) return STATUS_SUCCESS; buffer = (char *)buffer + ret; continue; } @@ -234,8 +234,7 @@ static void read_reply_data( void *buffer, size_t size ) if (errno == EPIPE) break; server_protocol_perror("read"); } - /* the server closed the connection; time to die... */ - abort_thread(0); + return STATUS_THREAD_IS_TERMINATING; }
@@ -246,9 +245,11 @@ static void read_reply_data( void *buffer, size_t size ) */ static inline unsigned int wait_reply( struct __server_request_info *req ) { - read_reply_data( &req->u.reply, sizeof(req->u.reply) ); - if (req->u.reply.reply_header.reply_size) - read_reply_data( req->reply_data, req->u.reply.reply_header.reply_size ); + unsigned int ret; + ret = read_reply_data( &req->u.reply, sizeof(req->u.reply) ); + if (!ret && req->u.reply.reply_header.reply_size) + ret = read_reply_data( req->reply_data, req->u.reply.reply_header.reply_size ); + if (ret) return ret; return req->u.reply.reply_header.error; }
`read_reply_data` can now return failure. Shall we update its only caller, `wait_reply`, to handle it?
I meant to do that but accidentally forgot; fixed now.
The abandoned thread, detached from the server, is no longer forcibly aborted and thus let loose to do whatever it wants as long as it doesn't involve a server call, and in case of abnormally terminated server, a SIGQUIT might not have even arrived.
The commit message describes why I think this is fine. If the hangup was intentional, the thread will process SIGQUIT once it's ready; if it wasn't, letting the thread hang doesn't make things (much) worse in practice, i.e. the user is already going to have to manually kill a bunch of stuck Wine processes.
On Wed Jan 10 12:07:17 2024 +0000, Zebediah Figura wrote:
`read_reply_data` can now return failure. Shall we update its only
caller, `wait_reply`, to handle it? I meant to do that but accidentally forgot; fixed now.
The abandoned thread, detached from the server, is no longer forcibly
aborted and thus let loose to do whatever it wants as long as it doesn't involve a server call, and in case of abnormally terminated server, a SIGQUIT might not have even arrived. The commit message describes why I think this is fine. If the hangup was intentional, the thread will process SIGQUIT once it's ready; if it wasn't, letting the thread hang doesn't make things (much) worse in practice, i.e. the user is already going to have to manually kill a bunch of stuck Wine processes.
IMHO my problem isn't about actually hung (blocking) processes; they are doomed to get stuck anyway as you said.
My problem is processes that *continue* to do something externally visible without assistance from the server. A nontrivial portion of syscalls don't actually require the server to work; for example, NtWriteFile() can work with fd cache alone as long as it doesn't need to send completion to the server. Other syscalls that *do* contact server will just return `STATUS_THREAD_IS_TERMINATING`, which has ample possibility to confuse the app which won't voluntarily terminate by itself.
I think calling `raise(SIGQUIT);` just before returning `STATUS_THREAD_IS_TERMINATING` will fix all that, while still achieving your purpose. If the server has already sent SIGQUIT, this has virtually no effect; otherwise, the signal will be delievered once we unblock it later.
`receive_fd` is called while `fd_cache_mutex` is held and has `abort_thread(0);` too; maybe it needs similar treatment?
My problem is processes that *continue* to do something externally visible without assistance from the server. A nontrivial portion of syscalls doesn't actually require the server to work; for example, NtWriteFile() can work with fd cache alone as long as it doesn't need to send completion to the server.
I don't think this is a meaningful concern?
Other syscalls that *do* contact server will just return `STATUS_THREAD_IS_TERMINATING`, which has ample possibility to confuse the app which won't voluntarily terminate by itself and will continue doing something weird externally.
I think calling `raise(SIGQUIT);` just before returning `STATUS_THREAD_IS_TERMINATING` will fix all that, while still achieving your purpose. If the server has already sent SIGQUIT, this has virtually no effect; otherwise, the signal will be delievered once we unblock it later.
That probably makes sense, though.
`receive_fd` is called while `fd_cache_mutex` is held and has `abort_thread(0);` too; maybe it needs similar treatment?
It does, but it doesn't affect this application. I don't know if should be added to the series for code freeze in that case.