I replaced `mov %gs:0x30,%rcx; mov 0x328(%rcx),%rcx` with `mov %gs:0x328,%rcx`, and it works fine though I don't know why there was this double indirection? It is choking the CPU on entry.
That's needed for macOS. It will be removed when we implement %gs switching on syscall entry/exit.