Yes, in context of avx512 I think that we don't need to worry about `SIGUSR1` performance at this point, it would be good to have more of the infrastructure first. I'd expect syscall dispatcher to be much more sensitive, for example.
Yes, it is a problem. The core problem here is that now everything on Linux uses avx[512] when available, starting from glibc memcpy. And avx512 is never cleared by the user, leaving huge context in INUSE state. So avx/avx512 gets clobbered in our syscalls and unix calls, and avx512 gets polluted. We now don't restore (avx) xstate on syscall dispatcher exit, probably purely as an optimization? As those syscalls are not supposed to clobber xstate while currently they do. My idea here, since we essentially clobber xstate inside syscalls anyway, is maybe we can remove saving xstate on syscall dispacther entry at all? And then save it only at the start of functions where it makes sense (I can currently think of only NtGetContextThread and NtSuspendThread). I think it will functionally change nothing, as now xstate can be clobbered anytime during syscall execution and we don't restore it. NtGetContextThread will still be able to return full correct context on function entry. I have a WIP patch for that on top of my current generic xstate work.
For the optimization itself, messing with a separated signal does not seem right. Maybe instead of worrying about signal performance, we could avoid the signal more often in the first place? It seems to me that it would be possible to use any available suspended thread instead of specific thread to deliver async result much more often. This could potentially make use of `SIGUSR1` for purposed other than the actual suspend very rare.
It seems like it per se, probably makes sense to do it in any case? The problem is though, that with broadly used downstream sync impls (esync / fsync / msync) and hopefully soon upstream ntsync server waits will become very rare (present for some very corner cases), so we still won't be finding a thread waiting in select.
Another possibility would be to decouple context availability from thread suspension in server. For example, we could have a pseudo-APC that server could use to retrieve the context. With that, server could immediately consider waiting thread as suspended and retrieve the context only when it's actually needed. I'm not sure if it's worth the complexity without trying.
I don't yet quite understand how that can work (or, maybe, how that is much different from this patch). If retrieving context always goes through system APC, for the suspended threads waiting in server_select() already called from usr1_handler the only way is probably to deliver that through select result, and make the caller provide the context, which looks very similar to what this patch is doing. Or am I probably missing something in the suggestion?
If still think about something in the direction of the patch, maybe it can be made a bit simpler by not handling no-context return at wait_suspend caller, just server_select not sending context at once. And only if the context has big xstate, so the other archs or cases without xstate won't require another server call.