Currently usr1_handler always delivers the context to server. With AVX enabled the single context size is around 1k (while there are two of those for wow64). I am now working on a more generic xstate support in contexts (mostly motivated by AVX512), with AVX512 the single context to be transferred to server is ~1k bigger.
The context is needed to be passed to the server from usr1_handler only for NtGetContextThread, NtSetContextThread and NtSuspendThread handling (e. g., when stop_thread is called on the server). The vast majority of usr1_handler usage is for kernel APCs (e. g., APC_ASYNC_IO involved in every async operation) that don't need the thread context transferred to the server and back.
My measurements of single SERVER_START_REQ( select ) in server_select() shows that the turnaround time of the request with the context (on native x64 without wow context) is almost two times bigger on average when currently supported AVX context is involved (that is, on every usr1_handle invocation on a machine supporting AVX). So, this patch is expected to increase the time of relatively rare calls which actually need contexts by roughly 50% but decrease the turnaround time of frequent calls involving system APCs by 50%. The difference will be more in favour of this patch once huge AVX512 contexts are added.