From: Paul Gofman pgofman@codeweavers.com
--- dlls/kernel32/tests/pipe.c | 2 +- server/thread.c | 36 ++++++++++++++++++++++-------------- 2 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/dlls/kernel32/tests/pipe.c b/dlls/kernel32/tests/pipe.c index aff65fe2864..74cb726110b 100644 --- a/dlls/kernel32/tests/pipe.c +++ b/dlls/kernel32/tests/pipe.c @@ -127,7 +127,7 @@ static void _test_not_signaled(unsigned line, HANDLE handle) #define test_signaled(h) _test_signaled(__LINE__,h) static void _test_signaled(unsigned line, HANDLE handle) { - DWORD res = WaitForSingleObject(handle, 0); + DWORD res = WaitForSingleObject(handle, 50); ok_(__FILE__,line)(res == WAIT_OBJECT_0, "WaitForSingleObject returned %lu\n", res); }
diff --git a/server/thread.c b/server/thread.c index 56f57cefd8f..5da980c6c75 100644 --- a/server/thread.c +++ b/server/thread.c @@ -1090,9 +1090,23 @@ static inline int is_in_apc_wait( struct thread *thread ) (thread->wait && (thread->wait->flags & SELECT_INTERRUPTIBLE))); }
+/* try to find a thread in APC wait */ +static struct thread *find_thread_in_apc_wait( struct process *process ) +{ + struct thread *thread; + + LIST_FOR_EACH_ENTRY( thread, &process->thread_list, struct thread, proc_entry ) + { + if (thread->state == TERMINATED || !is_in_apc_wait( thread )) continue; + return thread; + } + return NULL; +} + /* queue an existing APC to a given thread */ static int queue_apc( struct process *process, struct thread *thread, struct thread_apc *apc ) { + struct thread *candidate; struct list *queue;
if (thread && thread->state == TERMINATED && process) @@ -1100,19 +1114,7 @@ static int queue_apc( struct process *process, struct thread *thread, struct thr
if (!thread) /* find a suitable thread inside the process */ { - struct thread *candidate; - - /* first try to find a waiting thread */ - LIST_FOR_EACH_ENTRY( candidate, &process->thread_list, struct thread, proc_entry ) - { - if (candidate->state == TERMINATED) continue; - if (is_in_apc_wait( candidate )) - { - thread = candidate; - break; - } - } - if (!thread) + if (!(thread = find_thread_in_apc_wait( process ))) { /* then use the first one that accepts a signal */ LIST_FOR_EACH_ENTRY( candidate, &process->thread_list, struct thread, proc_entry ) @@ -1134,7 +1136,13 @@ static int queue_apc( struct process *process, struct thread *thread, struct thr /* send signal for system APCs if needed */ if (queue == &thread->system_apc && list_empty( queue ) && !is_in_apc_wait( thread )) { - if (!send_thread_signal( thread, SIGUSR1 )) return 0; + if (process && (candidate = find_thread_in_apc_wait( process ))) + { + /* System APC which can be queued to any thread, found a thread in APC wait first. */ + thread = candidate; + if (!(queue = get_apc_queue( thread, apc->call.type ))) return 1; + } + else if (!send_thread_signal( thread, SIGUSR1 )) return 0; } /* cancel a possible previous APC with the same owner */ if (apc->owner) thread_cancel_apc( thread, apc->owner, apc->call.type );
If some thread is already in APC wait (server_select), skip signal and addition server select overhead.
Notes: 1. It is tempting to just trivially pass NULL thread to thread_queue_apc() in async_terminate. But if a thread in APC wait won't be found that will mean that a signal will be sent to some random thread, which might be some time critical thread not doing any I/O, I think that would be very unfortunate.
2. The added timeout in kernel32/tests/pipe.c. There is a pre-existing minor disrepancy in async I/O handling with Windows. If some I/O call satisfied an overlapped I/O, its event will be signaled before the I/O call return on Windows, but not necessarily on Wine because before signaling the event an async should be processed in the client process and reply should be processed on the server. These tests succeeded before the patch only because queuing async write and then satisfying it with read is performed in the same thread. So if the I/O completing the previous async part is in the same thread, sigusr1 was sent during completing I/O server call and processed on the client before returning from I/O call. This is probably least practically interesting case, a bit more interesting case is when the prior async read or write is queued from another thread or process, but that won't (usually) be signaled before the completing I/O call exits in another thread neither before nor after this patch.
Jinoh Kang (@iamahuman) commented about server/thread.c:
/* send signal for system APCs if needed */ if (queue == &thread->system_apc && list_empty( queue ) && !is_in_apc_wait( thread )) {
if (!send_thread_signal( thread, SIGUSR1 )) return 0;
if (process && (candidate = find_thread_in_apc_wait( process )))
I think this will keep the test intact. Do you have any specific use case against this?
```suggestion:-0+0 if (thread != current && process && (candidate = find_thread_in_apc_wait( process ))) ```
On Fri Feb 16 17:18:37 2024 +0000, Jinoh Kang wrote:
I think this will keep the test intact. Do you have any specific use case against this?
if (thread != current && process && (candidate = find_thread_in_apc_wait( process )))
I don't immediately see how this helps. It will remove the effect of the patch when async IO is going to be queued by the same thread which queued original async OP which will undesirably limit optimization scope. Also, in the view of I described, it would "fix" only the limited, not the most practically interesting aspect of the problem. I suppose the choice here is between: - Go without such a limitation; - Even if it looks like a practically least important case, don't touch it turning the things into todo until we have some way to solve the problem in general (fwiw I tried thinking of how to solve that minor sync problem but didn't have immediate good ideas how that can be synchronized without introducing convoluted and perf impacting synchronization).
On Fri Feb 16 17:32:32 2024 +0000, Paul Gofman wrote:
I don't immediately see how this helps. It will remove the effect of the patch when async IO is going to be queued by the same thread which queued original async OP which will undesirably limit optimization scope. Also, in the view of I described, it would "fix" only the limited, not the most practically interesting aspect of the problem. I suppose the choice here is between:
- Go without such a limitation;
- Even if it looks like a practically least important case, don't touch
it turning the things into todo until we have some way to solve the problem in general (fwiw I tried thinking of how to solve that minor sync problem but didn't have immediate good ideas how that can be synchronized without introducing convoluted and perf impacting synchronization).
Maybe a bit more interesting thing in this direction would be to make the calling thread which is making server request process async queue itself always (one or another way, it should be technically possible), I think this would have a positive effect of not depending on the presence of threads in APC wait, but unfortunately that can only work if the async queuing thread is from the same process as the one which should process it. For the same process case that would solve the races though (and made this optimization more reliable by not depending on waiting in server_select threads).
On Fri Feb 16 17:46:56 2024 +0000, Paul Gofman wrote:
Maybe a bit more interesting thing in this direction would be to make the calling thread which is making server request process async queue itself always (one or another way, it should be technically possible), I think this would have a positive effect of not depending on the presence of threads in APC wait, but unfortunately that can only work if the async queuing thread is from the same process as the one which should process it. For the same process case that would solve the races though (and made this optimization more reliable by not depending on waiting in server_select threads).
"process async queue itself always": I mean without signal, directly upon returning from the server call.
On Sat Feb 17 00:09:49 2024 +0000, Paul Gofman wrote:
"process async queue itself always": I mean without signal, directly upon returning from the server call.
As long as there's no regression in real apps it should be fine.
Jinoh Kang (@iamahuman) commented about dlls/kernel32/tests/pipe.c:
#define test_signaled(h) _test_signaled(__LINE__,h) static void _test_signaled(unsigned line, HANDLE handle) {
- DWORD res = WaitForSingleObject(handle, 0);
- DWORD res = WaitForSingleObject(handle, 50); ok_(__FILE__,line)(res == WAIT_OBJECT_0, "WaitForSingleObject returned %lu\n", res);
}
This relaxes unaffected tests (e.g., testing if `CreateNamedPipeA()` returns a handle that is initially signaled). I think we should either add a timeout param or introduce a new "wait signaled" helper.
Jinoh Kang (@iamahuman) commented about dlls/kernel32/tests/pipe.c:
#define test_signaled(h) _test_signaled(__LINE__,h) static void _test_signaled(unsigned line, HANDLE handle) {
- DWORD res = WaitForSingleObject(handle, 0);
- DWORD res = WaitForSingleObject(handle, 50);
This timeout is too short to avoid flakiness. How about 1000ms?
On Sat Feb 17 00:15:27 2024 +0000, Jinoh Kang wrote:
This timeout is too short to avoid flakiness. How about 1000ms?
I agree with both comments (while I don't see 50ms is something risky in such a case, doing that 1000ms also won't hurt).
However, before adjusting and resending the patches for nits, I'd wait for a general comment(s) if this is going forward at all.
On Sat Feb 17 00:19:25 2024 +0000, Paul Gofman wrote:
I agree with both comments (while I don't see 50ms is something risky in such a case, doing that 1000ms also won't hurt). However, before adjusting and resending the patches for nits, I'd wait for a general comment(s) if this is going forward at all.
FWIW my own major concern with the patch is that (hopefully) upcoming ntsync changes should render it useless (as it essentially is already with esync / fsync out of tree patches). In fact, adding a facility to request APC processing from the calling thread upon server call exit (as I suggested in another comment) also won't help much as in many practically interesting cases the async completion is triggered by server fd poll without any calling thread present (named pipes with IO fully performed through server are an exception but probably less interesting perf wise than sockets).
On Sat Feb 17 00:24:56 2024 +0000, Paul Gofman wrote:
FWIW my own major concern with the patch is that (hopefully) upcoming ntsync changes should render it useless (as it essentially is already with esync / fsync out of tree patches). In fact, adding a facility to request APC processing from the calling thread upon server call exit (as I suggested in another comment) also won't help much as in many practically interesting cases the async completion is triggered by server fd poll without any calling thread present (named pipes with IO fully performed through server are an exception but probably less interesting perf wise than sockets).
A faster sync will speed up the wait operation for sure, but I don't yet understand how the faster sync will make this patch unnecessary.
On Sat Feb 17 00:51:03 2024 +0000, Jinoh Kang wrote:
A faster sync will speed up the wait operation for sure, but I don't yet understand how the faster sync will make this patch unnecessary.
It won't make it unnecessary, it will make it useless with working ntsync. As even with fsync already (which has some object waits unsupported), I observe in some real games that we are seriously lacking any threads sitting in APC wait (waiting in server_select) which could be the candidates to process those asyncs. I think normally we won't be having any such threads unless there is some suspended thread exist which is a rare case.
Jinoh Kang (@iamahuman) commented about dlls/kernel32/tests/pipe.c:
#define test_signaled(h) _test_signaled(__LINE__,h) static void _test_signaled(unsigned line, HANDLE handle) {
- DWORD res = WaitForSingleObject(handle, 0);
- DWORD res = WaitForSingleObject(handle, 50);
(while I don't see 50ms is something risky in such a case, doing that 1000ms also won't hurt).
Right, it isn't risky when run standalone. My main concern is that with testbot batch/concurrent runs, (k)vm threads randomly starve for an arbitrarily amount of time, resulting in spurious failures. Scheduler issue or not, this is probably not going to be fixed anytime soon.
On Sat Feb 17 01:00:37 2024 +0000, Paul Gofman wrote:
It won't make it unnecessary, it will make it useless with working ntsync. As even with fsync already (which has some object waits unsupported), I observe in some real games that we are seriously lacking any threads sitting in APC wait (waiting in server_select) which could be the candidates to process those asyncs. I think normally we won't be having any such threads unless there is some suspended thread exist which is a rare case.
If I understood your comments correctly, this patch
1. doesn't help Proton *with esync/fsync enabled* as well as upstream Wine with ntsync, and 2. may end up causing regression if some app actually turned out to rely on the "least practically interesting case" you have described.
Of course, Proton works with esync/fsync disabled. For completeness, would you mind sharing what (or how many) games are affected by SIGUSR1 perf?[^1]
Also, if this fix isn't urgent, I wonder if we should instead spend some time to find a more general mechanism that matches Windows behavior more closely (instead of deviating from it further and causing potential regression).
[^1]: I think this applies to other MRs that touch SIGUSR1 behaviour as well.