I think something like that could work although there's a few things I'm still not completely sure about:
Should we actually do it that way with a -nofpu flag enabled selectively, or rather do it the other way, a bit more like the Linux kernel does it with a -fpusave flag for syscalls where it matters?
I also used the syscall table number high bits to store some information, it's maybe not ideal but it is very convenient. Another possibility would be to have a dedicated sevice table for fast calls.
Is the suspend usr1_handler the only place where we might be missing some context bits? And is it okay to use the signal context for the missing pieces, or should we rather zero the state?
Could we extend the mechanism and use the signal context to get more registers unlikely to change, like some of the segment registers?
I replaced `mov %gs:0x30,%rcx; mov 0x328(%rcx),%rcx` with `mov %gs:0x328,%rcx`, and it works fine though I don't know why there was this double indirection? It is choking the CPU on entry.
What can we do about rflags? Skipping it entirely helps a bit more, but it doesn't seem right.