`%gs` is not set until `call_init_thunk()`, this caused a crash on macOS when starting the second thread in a process.
(Paul was curious why this didn't fail on Linux, and found that `%fs` is being inherited from the creating thread.)
Also, adjust TEB accesses in other assembly functions to be consistent with the surrounding code (using registers, not fs/gs).
Fixes a crash on macOS introduced by 7ae488a2bb58501684c6475d4942277b852475fc ("ntdll: Don't hardcode xstate size in syscall frame.")