Currently usr1_handler always delivers the context to server. With AVX enabled the single context size is around 1k (while there are two of those for wow64). I am now working on a more generic xstate support in contexts (mostly motivated by AVX512), with AVX512 the single context to be transferred to server is ~1k bigger.
The context is needed to be passed to the server from usr1_handler only for NtGetContextThread, NtSetContextThread and NtSuspendThread handling (e. g., when stop_thread is called on the server). The vast majority of usr1_handler usage is for kernel APCs (e. g., APC_ASYNC_IO involved in every async operation) that don't need the thread context transferred to the server and back.
My measurements of single SERVER_START_REQ( select ) in server_select() shows that the turnaround time of the request with the context (on native x64 without wow context) is almost two times bigger on average when currently supported AVX context is involved (that is, on every usr1_handle invocation on a machine supporting AVX). So, this patch is expected to increase the time of relatively rare calls which actually need contexts by roughly 50% but decrease the turnaround time of frequent calls involving system APCs by 50%. The difference will be more in favour of this patch once huge AVX512 contexts are added.
-- v2: ntdll: Avoid sending context in wait_suspend() when not required.
From: Paul Gofman pgofman@codeweavers.com
--- dlls/ntdll/unix/server.c | 7 ++++--- dlls/ntdll/unix/signal_arm.c | 2 ++ dlls/ntdll/unix/signal_arm64.c | 2 ++ dlls/ntdll/unix/signal_i386.c | 2 ++ dlls/ntdll/unix/signal_x86_64.c | 2 ++ dlls/ntdll/unix/thread.c | 12 +++++++----- dlls/ntdll/unix/unix_private.h | 2 +- server/protocol.def | 1 + server/thread.c | 8 ++++++++ 9 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/dlls/ntdll/unix/server.c b/dlls/ntdll/unix/server.c index 69e6567c911..f96a08e40b2 100644 --- a/dlls/ntdll/unix/server.c +++ b/dlls/ntdll/unix/server.c @@ -693,7 +693,6 @@ unsigned int server_select( const select_op_t *select_op, data_size_t size, UINT unsigned int ret; int cookie; obj_handle_t apc_handle = 0; - BOOL suspend_context = !!context; apc_result_t result; sigset_t old_set; int signaled; @@ -704,6 +703,8 @@ unsigned int server_select( const select_op_t *select_op, data_size_t size, UINT context_t context[2]; } reply_data;
+ assert(!context || (flags & SELECT_SUSPEND)); + memset( &result, 0, sizeof(result) );
do @@ -720,11 +721,11 @@ unsigned int server_select( const select_op_t *select_op, data_size_t size, UINT req->size = size; wine_server_add_data( req, &result, sizeof(result) ); wine_server_add_data( req, select_op, size ); - if (suspend_context) + if (context && flags & SELECT_SUSPEND) { data_size_t ctx_size = (context[1].machine ? 2 : 1) * sizeof(*context); wine_server_add_data( req, context, ctx_size ); - suspend_context = FALSE; /* server owns the context now */ + flags &= ~SELECT_SUSPEND; /* server owns the context now */ } wine_server_set_reply( req, &reply_data, context ? sizeof(reply_data) : sizeof(reply_data.call) ); diff --git a/dlls/ntdll/unix/signal_arm.c b/dlls/ntdll/unix/signal_arm.c index 5115fa7ec3a..46f8d93cdb4 100644 --- a/dlls/ntdll/unix/signal_arm.c +++ b/dlls/ntdll/unix/signal_arm.c @@ -1518,6 +1518,8 @@ static void usr1_handler( int signal, siginfo_t *siginfo, void *sigcontext ) { CONTEXT context;
+ if (wait_suspend( NULL ) != STATUS_MORE_PROCESSING_REQUIRED) return; + if (is_inside_syscall( sigcontext )) { context.ContextFlags = CONTEXT_FULL; diff --git a/dlls/ntdll/unix/signal_arm64.c b/dlls/ntdll/unix/signal_arm64.c index f96ec330796..ac8d23f918e 100644 --- a/dlls/ntdll/unix/signal_arm64.c +++ b/dlls/ntdll/unix/signal_arm64.c @@ -1485,6 +1485,8 @@ static void usr1_handler( int signal, siginfo_t *siginfo, void *sigcontext ) { CONTEXT context;
+ if (wait_suspend( NULL ) != STATUS_MORE_PROCESSING_REQUIRED) return; + if (is_inside_syscall( sigcontext )) { context.ContextFlags = CONTEXT_FULL; diff --git a/dlls/ntdll/unix/signal_i386.c b/dlls/ntdll/unix/signal_i386.c index 751b0081534..f7ee66b926c 100644 --- a/dlls/ntdll/unix/signal_i386.c +++ b/dlls/ntdll/unix/signal_i386.c @@ -2129,6 +2129,8 @@ static void usr1_handler( int signal, siginfo_t *siginfo, void *sigcontext ) struct xcontext xcontext;
init_handler( sigcontext ); + if (wait_suspend( NULL ) != STATUS_MORE_PROCESSING_REQUIRED) return; + if (is_inside_syscall( sigcontext )) { DECLSPEC_ALIGN(64) XSTATE xs; diff --git a/dlls/ntdll/unix/signal_x86_64.c b/dlls/ntdll/unix/signal_x86_64.c index 53827629af4..e8f8cda53d4 100644 --- a/dlls/ntdll/unix/signal_x86_64.c +++ b/dlls/ntdll/unix/signal_x86_64.c @@ -2151,6 +2151,8 @@ static void usr1_handler( int signal, siginfo_t *siginfo, void *sigcontext ) ucontext_t *ucontext = init_handler( sigcontext ); struct xcontext context;
+ if (wait_suspend( NULL ) != STATUS_MORE_PROCESSING_REQUIRED) return; + if (is_inside_syscall( ucontext )) { DECLSPEC_ALIGN(64) XSTATE xs; diff --git a/dlls/ntdll/unix/thread.c b/dlls/ntdll/unix/thread.c index 9e84ec3cc96..c67e29e47fd 100644 --- a/dlls/ntdll/unix/thread.c +++ b/dlls/ntdll/unix/thread.c @@ -1458,16 +1458,18 @@ void exit_process( int status ) * * Wait until the thread is no longer suspended. */ -void wait_suspend( CONTEXT *context ) +NTSTATUS wait_suspend( CONTEXT *context ) { int saved_errno = errno; context_t server_contexts[2]; + NTSTATUS status;
- contexts_to_server( server_contexts, context ); + if (context) contexts_to_server( server_contexts, context ); /* wait with 0 timeout, will only return once the thread is no longer suspended */ - server_select( NULL, 0, SELECT_INTERRUPTIBLE, 0, server_contexts, NULL ); - contexts_from_server( context, server_contexts ); + status = server_select( NULL, 0, SELECT_INTERRUPTIBLE | SELECT_SUSPEND, 0, context ? server_contexts : NULL, NULL ); + if (context) contexts_from_server( context, server_contexts ); errno = saved_errno; + return status; }
@@ -1513,7 +1515,7 @@ NTSTATUS send_debug_event( EXCEPTION_RECORD *rec, CONTEXT *context, BOOL first_c select_op.wait.handles[0] = handle;
contexts_to_server( server_contexts, context ); - server_select( &select_op, offsetof( select_op_t, wait.handles[1] ), SELECT_INTERRUPTIBLE, + server_select( &select_op, offsetof( select_op_t, wait.handles[1] ), SELECT_INTERRUPTIBLE | SELECT_SUSPEND, TIMEOUT_INFINITE, server_contexts, NULL );
SERVER_START_REQ( get_exception_status ) diff --git a/dlls/ntdll/unix/unix_private.h b/dlls/ntdll/unix/unix_private.h index 07f2724eac7..1c8016e6471 100644 --- a/dlls/ntdll/unix/unix_private.h +++ b/dlls/ntdll/unix/unix_private.h @@ -218,7 +218,7 @@ extern NTSTATUS init_thread_stack( TEB *teb, ULONG_PTR limit, SIZE_T reserve_siz extern void DECLSPEC_NORETURN abort_thread( int status ); extern void DECLSPEC_NORETURN abort_process( int status ); extern void DECLSPEC_NORETURN exit_process( int status ); -extern void wait_suspend( CONTEXT *context ); +extern NTSTATUS wait_suspend( CONTEXT *context ); extern NTSTATUS send_debug_event( EXCEPTION_RECORD *rec, CONTEXT *context, BOOL first_chance ); extern NTSTATUS set_thread_context( HANDLE handle, const void *context, BOOL *self, USHORT machine ); extern NTSTATUS get_thread_context( HANDLE handle, void *context, BOOL *self, USHORT machine ); diff --git a/server/protocol.def b/server/protocol.def index 5d60e7fcda3..86019a1ad32 100644 --- a/server/protocol.def +++ b/server/protocol.def @@ -1237,6 +1237,7 @@ typedef struct @END #define SELECT_ALERTABLE 1 #define SELECT_INTERRUPTIBLE 2 +#define SELECT_SUSPEND 4
/* Create an event */ diff --git a/server/thread.c b/server/thread.c index 56f57cefd8f..99dda8f7287 100644 --- a/server/thread.c +++ b/server/thread.c @@ -1656,6 +1656,14 @@ DECL_HANDLER(select) release_object( apc ); }
+ if (!ctx_count && current->context && req->flags & SELECT_SUSPEND) + { + /* select is called from signal and thread context is required. */ + set_error( STATUS_MORE_PROCESSING_REQUIRED ); + reply->signaled = 1; + return; + } + reply->signaled = select_on( &select_op, op_size, req->cookie, req->flags, req->timeout );
if (get_error() == STATUS_USER_APC && get_reply_max_size() >= sizeof(apc_call_t))
Is it just data copying that makes the difference? I guess we could try to use some context compaction similar to XSAVEC and not transfer zeroed extended context areas at all, which would hopefully cover majority of cases.
Yes, the only difference I see in this measurement is transferring a lot of data in server request (capturing context in sig_usr1 is additional and not included in the time I measured).
I thought of varying the data size in context request, but the context size won't be smaller if the state is actually present. It is probably less of the issue with AVX where it is normal to clean up the state after use. But as far as I can preliminary see clearing avx512 regs after use is not a thing with AVX512 (I could not find an analogous to vzeroupper even to clear all the avx512 state without xrstor), probably once any code used that it is going to stay unclean. Also, it seems to me introducing variable context size in server context is more complicated than this patch, or am I wrong?
Jinoh Kang (@iamahuman) commented about server/thread.c:
release_object( apc ); }
- if (!ctx_count && current->context && req->flags & SELECT_SUSPEND)
- {
/* select is called from signal and thread context is required. */
set_error( STATUS_MORE_PROCESSING_REQUIRED );
reply->signaled = 1;
return;
- }
Can we move this to `check_wait()`, ideally just after `return STATUS_KERNEL_APC;`?
This introduces a new "failure" mode of `select` request, which will cause the `server_select` loop to be interrupted in a way that I find hard to reason about.
I don't see how the more-processing-required condition is fundamentally different from other signalling conditions that satisfies the wait either.
On Wed Jan 24 16:23:35 2024 +0000, Paul Gofman wrote:
Yes, the only difference I see in this measurement is transferring a lot of data in server request (capturing context in sig_usr1 is additional and not included in the time I measured). I thought of varying the data size in context request, but the context size won't be smaller if the state is actually present. It is probably less of the issue with AVX where it is normal to clean up the state after use. But as far as I can preliminary see clearing avx512 regs after use is not a thing with AVX512 (I could not find an analogous to vzeroupper even to clear all the avx512 state without xrstor), probably once any code used that it is going to stay unclean. Also, it seems to me introducing variable context size in server context is more complicated than this patch, or am I wrong?
Isn't an analogous price for dirty AVX512 is payed on Windows then? I guess XSAVEOPT may partially mitigate that, but still.
Also, it seems to me introducing variable context size in server context is more complicated than this patch, or am I wrong?
It'd likely be more complicated. I don't know if it'd be worth it, but trading performance of one code path over another does not look ideal, so I wonder if there is a better way.
Jinoh Kang (@iamahuman) commented about server/protocol.def:
@END #define SELECT_ALERTABLE 1 #define SELECT_INTERRUPTIBLE 2 +#define SELECT_SUSPEND 4
Would one of the following names be more descriptive?
```suggestion:-0+0 #define SELECT_CLIENT_OWNS_CTX 4 ```
```suggestion:-0+0 #define SELECT_CLIENT_HAS_CTX 4 ```
On Wed Jan 24 16:29:10 2024 +0000, Jinoh Kang wrote:
Can we move this to `check_wait()`, ideally just after `return STATUS_KERNEL_APC;`? I don't see how the more-processing-required condition is fundamentally different from other signalling conditions that satisfies the wait. Abruptly returning introduces a new "error" mode of `select` request, which will cause the `server_select` loop to be interrupted in a way that I find hard to reason about. For example, it'll interrupt a stream of incoming kernel APCs.[^1] This in particular is probably not an issue, but I'm afraid there are other undiscovered failure modes introduced by this special casing. [^1]: Another thread may call `NtGetContextThread()` while the current thread is still in the `server_select()` loop processing kernel APCs.
New "failure" mode is probably unavoidable, as in the logic of the patch server_select() should fail in this case and indicate that it should be rerun with the context provided and I don't see how that can be done by normal server_select status. This failure should happen before any actual wait because if the thread is being suspended it should provide the context first. It is probably only a matter of where an when the check is performed. Doing that in (the depths of) actual select looks counter intuitive to me and very hard to follow, IMO explicit handling in (select) handler is more straightforward. I think that interrupting stream of system APCs is the desired effect, it may continue with those after providing the context.
Another thread may call `NtGetContextThread()` while the current thread is still in the `server_select()` loop processing kernel APCs. [:leftwards_arrow_with_hook:](#fnref-1-3093)
This is currently problematic exactly the same way before and after this patch. See kernel32/tests: apc_deadlock_thread().
On Wed Jan 24 16:23:35 2024 +0000, Jacek Caban wrote:
Isn't an analogous price for dirty AVX512 is payed on Windows then? I guess XSAVEOPT may partially mitigate that, but still.
Also, it seems to me introducing variable context size in server
context is more complicated than this patch, or am I wrong? It'd likely be more complicated. I don't know if it'd be worth it, but trading performance of one code path over another does not look ideal, so I wonder if there is a better way.
I am not sure if Windows actually saves the full state on every syscall. In theory that can be avoided there if the kernel part doesn't clobber avx512 state (and only saves parts of the state when it is needed) not sure if I can establish that for sure (but I guess more likely it saves that in full at once?). What I tested so far is that some arbitrary system calls do not reset avx512 state. Maybe in theory we can go for some optimization here and reset the state at some syscalls (as the state is volatile and syscall ABI doesn't demand it to be preserved), even if that doesn't match Windows. But here is the risk that we will encounter something depending on that to behave like Windows.
I am not sure how xsaveopt can help? It is an older analogue of xsavec which we already use. xsavec is like xsaveopt in terms of not saving the state which is in INIT state, just additionally uses compacted save area layout. Any of those do help when we have cleared state, but if avx512 regs are not zero they will have to save anyway.
but trading performance of one code path over another does not look ideal, so I wonder if there is a better way.
Sure, but if the absence of those can't we trade a smaller perf drop in rare calls into bigger gain in very frequent? As the only alternative approach I can see so far, do you think we can find some other signal to deliver the request to do server_select with context?
On Wed Jan 24 16:32:16 2024 +0000, Jinoh Kang wrote:
Would one of the following names be more descriptive?
#define SELECT_CLIENT_OWNS_CTX 4
#define SELECT_CLIENT_HAS_CTX 4
I guess `SELECT_CLIENT_HAS_CTX` is clearer, yeah. I'd wait for the general decision about the patch to be established before sending such changes.
On Wed Jan 24 16:54:00 2024 +0000, Paul Gofman wrote:
I am not sure if Windows actually saves the full state on every syscall. In theory that can be avoided there if the kernel part doesn't clobber avx512 state (and only saves parts of the state when it is needed) not sure if I can establish that for sure (but I guess more likely it saves that in full at once?). What I tested so far is that some arbitrary system calls do not reset avx512 state. Maybe in theory we can go for some optimization here and reset the state at some syscalls (as the state is volatile and syscall ABI doesn't demand it to be preserved), even if that doesn't match Windows. But here is the risk that we will encounter something depending on that to behave like Windows. I am not sure how xsaveopt can help? It is an older analogue of xsavec which we already use. xsavec is like xsaveopt in terms of not saving the state which is in INIT state, just additionally uses compacted save area layout. Any of those do help when we have cleared state, but if avx512 regs are not zero they will have to save anyway.
but trading performance of one code path over another does not look
ideal, so I wonder if there is a better way. Sure, but if the absence of those can't we trade a smaller perf drop in rare calls into bigger gain in very frequent? As the only alternative approach I can see so far, do you think we can find some other signal to deliver the request to do server_select with context?
How about `sigqueue(3)` from POSIX.2001?
On Wed Jan 24 18:18:03 2024 +0000, Jinoh Kang wrote:
How about `sigqueue(3)` from POSIX.2001?
We can use `sigqueue(3)` if available to send a hint whether to send the context the first time.
Note that the signal argument (sival) will purely work as a hint. Even if the server didn't initially request client context via SIGUSR1, another thread might call NtGetContextThread() in the meanwhile and make `select()` return `STATUS_MORE_PROCESSING_REQUIRED`.
Even if the server didn't initially request client context via SIGUSR1, another thread might call NtGetContextThread() in the meanwhile and make `select()` return `STATUS_MORE_PROCESSING_REQUIRED`
I don't think it is a problem with the current patch. select() can't return STATUS_MORE_PROCESSING_REQUIRED from the select called without possibility to get a context, that is why the added flag is needed. Without the patch if a thread is in select called without context it doesn't process NtGetContextThread or suspend. For those SIGUSR1 is always sent (stop_thread() is used on the server). SIGUSR1 will issue new select "on top" of the one, then return to the interrupted one once complete. If we are already in select from sigusr1 with the patch, it may happen that it processes some system APCs first and then gets STATUS_MORE_PROCESSING_REQUIRED, that is fine. And implicitly tested, e. g., in kernel32/tests:sync.c test_apc_deadlock().
We can use `sigqueue(3)` if available to send a hint whether to send the context the first time.
Could you please elaborate how do you see it can be used? First as far as I can see it sends the signal to a process, we need specific thread. Then, what is the difference for our case to current way of sending signal?
Meanwhile, if it would be possible to send a distinct signal for contexts that would solve the major concern (worsening the performance for some paths, even if not most frequent), and also probably would simplify the thing. Technically SIGUSR2 looks like a good candidate, but maybe there are considerations against that which I am not aware about, like maybe it is known to be handled by libraries we depend upon.
@jacek @julliard do you think introducing SIGUSR2 as "stop_thread" handle is a go?
I am not sure how xsaveopt can help?
xsaveopt may skip storing even dirty parts if they didn't change since last restore. It's not possible to use it reliably from user space, so that's a possibility for Windows, but not for us.
More generally, the change on its own, without avx512 context, does not look all that profitable. It's optimizing only a chunk of overall slow operation. Eg. the interesting metric is how much it takes for APC to be delivered to client and I'd expect the patch to have minor impact on that. I understand that the impact will be bigger with avx512, but it's hard for me to comment on that without seeing the rest of it. For example how context_t will look like? Will it just grow? If transfer cost is high, do we want all platforms to pay it?
BTW, while suspending calls are more rare and indeed less time critical, their performance is not meaningless, see bug 45546 for example.
On Wed Jan 24 22:01:20 2024 +0000, Jacek Caban wrote:
I am not sure how xsaveopt can help?
xsaveopt may skip storing even dirty parts if they didn't change since last restore. It's not possible to use it reliably from user space, so that's a possibility for Windows, but not for us. More generally, the change on its own, without avx512 context, does not look all that profitable. It's optimizing only a chunk of overall slow operation. Eg. the interesting metric is how much it takes for APC to be delivered to client and I'd expect the patch to have minor impact on that. I understand that the impact will be bigger with avx512, but it's hard for me to comment on that without seeing the rest of it. For example how context_t will look like? Will it just grow? If transfer cost is high, do we want all platforms to pay it? BTW, while suspending calls are more rare and indeed less time critical, their performance is not meaningless, see bug 45546 for example.
Yes, probably variable size server context (at least for transfer, possibly keeping the structure size constant?) is unavodiable, doesn't make sense for all the archs pay the cost of huge x86 xstate.
What do you think about using a separate signal? If it is possible then it would optimize APCs without increasing the cost for any other cases.
Could you please elaborate how do you see it can be used? First as far as I can see it sends the signal to a process, we need specific thread.
Sorry. I confused it with `pthread_sigqueue(3)`.
Then, what is the difference for our case to current way of sending signal?
It enables us to attach a payload to `SOGUSR1`. Of course, if we can allocate a separate signal for ourselves, it seems better to use just that.
do you think introducing SIGUSR2 as "stop_thread" handle is a go?
FWIW, ARM64 uses SIGUSR2 for its own purpose.
Sorry. I confused it with `pthread_sigqueue(3)`.
but this is inprocess, we need to send from server;
It enables us to attach a payload to `SOGUSR1`. Of course, if we can allocate a separate signal for ourselves, it seems better to use just that.
sigqueue doesn't seem to allow to specify a thread for signal.
If you still think that this can be used for our purpose somehow, may I ask you to elaborate a bit more how exactly?
On Thu Jan 25 01:02:39 2024 +0000, Paul Gofman wrote:
Sorry. I confused it with `pthread_sigqueue(3)`.
but this is inprocess, we need to send from server;
It enables us to attach a payload to `SOGUSR1`. Of course, if we can
allocate a separate signal for ourselves, it seems better to use just that. sigqueue doesn't seem to allow to specify a thread for signal. If you still think that this can be used for our purpose somehow, may I ask you to elaborate a bit more how exactly?
There's `SYS_rt_tgsigqueueinfo` on Linux, but that's of course not at all portable.
On Thu Jan 25 17:17:55 2024 +0000, Paul Gofman wrote:
Another thread may call `NtGetContextThread()` while the current
thread is still in the `server_select()` loop processing kernel APCs. [:leftwards_arrow_with_hook:](#fnref-1-3093) This is currently problematic exactly the same way before and after this patch. See kernel32/tests:sync.c test_apc_deadlock().
Thanks!
On Thu Jan 25 01:55:19 2024 +0000, Torge Matthies wrote:
There's `SYS_rt_tgsigqueueinfo` on Linux, but that's of course not at all portable.
It seems like Linux and FreeBSD are the only platforms Wine can direct signals to specific threads (`send_thread_signal`).
If I'm (hopefully) not mistaken again, we can use rt_tgsigqueueinfo(2) on Linux to send flags while keeping the old path (always send context) for FreeBSD.
Furthermore, if in any case rt_tgsigqueueinfo(2) is not be available (e.g., seccomp, ancient kernel), we can always fall back to full context transmission.
On Thu Jan 25 17:31:14 2024 +0000, Jinoh Kang wrote:
It seems like Linux and FreeBSD are the only platforms Wine can direct signals to specific threads (`send_thread_signal`). If I'm (hopefully) not mistaken again, we can use rt_tgsigqueueinfo(2) on Linux to send flags while keeping the old path (always send context) for FreeBSD. Furthermore, if in any case rt_tgsigqueueinfo(2) is not be available (e.g., seccomp, ancient kernel), we can always fall back to full context transmission.
rt_tgsigqueueinfo looks interesting to me, thanks.
Anyway, in the view of more generic discussion it looks like it makes more sense to introduce variable sized context transfers (and maybe some other bits of generic xstate support first).
On Thu Jan 25 18:39:04 2024 +0000, Paul Gofman wrote:
rt_tgsigqueueinfo looks interesting to me, thanks. Anyway, in the view of more generic discussion it looks like it makes more sense to introduce variable sized context transfers (and maybe some other bits of generic xstate support first).
Yes, in context of avx512 I think that we don't need to worry about `SIGUSR1` performance at this point, it would be good to have more of the infrastructure first. I'd expect syscall dispatcher to be much more sensitive, for example.
For the optimization itself, messing with a separated signal does not seem right. Maybe instead of worrying about signal performance, we could avoid the signal more often in the first place? It seems to me that it would be possible to use any available suspended thread instead of specific thread to deliver async result much more often. This could potentially make use of `SIGUSR1` for purposed other than the actual suspend very rare.
Another possibility would be to decouple context availability from thread suspension in server. For example, we could have a pseudo-APC that server could use to retrieve the context. With that, server could immediately consider waiting thread as suspended and retrieve the context only when it's actually needed. I'm not sure if it's worth the complexity without trying.
Yes, in context of avx512 I think that we don't need to worry about `SIGUSR1` performance at this point, it would be good to have more of the infrastructure first. I'd expect syscall dispatcher to be much more sensitive, for example.
Yes, it is a problem. The core problem here is that now everything on Linux uses avx[512] when available, starting from glibc memcpy. And avx512 is never cleared by the user, leaving huge context in INUSE state. So avx/avx512 gets clobbered in our syscalls and unix calls, and avx512 gets polluted. We now don't restore (avx) xstate on syscall dispatcher exit, probably purely as an optimization? As those syscalls are not supposed to clobber xstate while currently they do. My idea here, since we essentially clobber xstate inside syscalls anyway, is maybe we can remove saving xstate on syscall dispacther entry at all? And then save it only at the start of functions where it makes sense (I can currently think of only NtGetContextThread and NtSuspendThread). I think it will functionally change nothing, as now xstate can be clobbered anytime during syscall execution and we don't restore it. NtGetContextThread will still be able to return full correct context on function entry. I have a WIP patch for that on top of my current generic xstate work.
For the optimization itself, messing with a separated signal does not seem right. Maybe instead of worrying about signal performance, we could avoid the signal more often in the first place? It seems to me that it would be possible to use any available suspended thread instead of specific thread to deliver async result much more often. This could potentially make use of `SIGUSR1` for purposed other than the actual suspend very rare.
It seems like it per se, probably makes sense to do it in any case? The problem is though, that with broadly used downstream sync impls (esync / fsync / msync) and hopefully soon upstream ntsync server waits will become very rare (present for some very corner cases), so we still won't be finding a thread waiting in select.
Another possibility would be to decouple context availability from thread suspension in server. For example, we could have a pseudo-APC that server could use to retrieve the context. With that, server could immediately consider waiting thread as suspended and retrieve the context only when it's actually needed. I'm not sure if it's worth the complexity without trying.
I don't yet quite understand how that can work (or, maybe, how that is much different from this patch). If retrieving context always goes through system APC, for the suspended threads waiting in server_select() already called from usr1_handler the only way is probably to deliver that through select result, and make the caller provide the context, which looks very similar to what this patch is doing. Or am I probably missing something in the suggestion?
If still think about something in the direction of the patch, maybe it can be made a bit simpler by not handling no-context return at wait_suspend caller, just server_select not sending context at once. And only if the context has big xstate, so the other archs or cases without xstate won't require another server call.