Today, NtCurrentTeb() has a single asm statement when compiling for the i386, x86-64, or ARM architecture with GCC. This single asm statement has neither the "volatile" qualifier nor the "memory" clobber.
This can provoke undefined behavior if the current TEB changes between NtCurrentTeb() calls. This is because GCC assumes that the asm statement does not depend on memory or the thread-local register. In fact, given the same address of the "teb" variable, GCC assumes that the asm statement produces exactly the same value on every invocaton.
This primarily causes issues when switching to another fiber from a thread that is different from the thread on which the fiber was last executed. Theoretically, however, this may also cause issues when the optimizer aliases the "teb" variable to another variable that is shared between threads, and perform global optimization that can work across multiple threads in runtime (if any).
Fix this by adding the "memory" clobber to the asm statements that computes the current TEB address.
From: Jinoh Kang jinoh.kang.kr@gmail.com
Today, NtCurrentTeb() has a single asm statement when compiling for the i386, x86-64, or ARM architecture with GCC. This single asm statement has neither the "volatile" qualifier nor the "memory" clobber.
This can provoke undefined behavior if the current TEB changes between NtCurrentTeb() calls. This is because GCC assumes that the asm statement does not depend on memory or the thread-local register. In fact, given the same address of the "teb" variable, GCC assumes that the asm statement produces exactly the same value on every invocaton.
This primarily causes issues when switching to another fiber from a thread that is different from the thread on which the fiber was last executed. Theoretically, however, this may also cause issues when the optimizer aliases the "teb" variable to another variable that is shared between threads, and perform global optimization that can work across multiple threads in runtime (if any).
Fix this by adding the "memory" clobber to the asm statements that computes the current TEB address. --- include/winnt.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/winnt.h b/include/winnt.h index 66f6b7ad809..97a9deac21d 100644 --- a/include/winnt.h +++ b/include/winnt.h @@ -2238,7 +2238,7 @@ struct _TEB; static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) { struct _TEB *teb; - __asm__(".byte 0x64\n\tmovl (0x18),%0" : "=r" (teb)); + __asm__(".byte 0x64\n\tmovl (0x18),%0" : "=r" (teb) :: "memory"); return teb; } #elif defined(__i386__) && defined(_MSC_VER) && !defined(WINE_UNIX_LIB) @@ -2253,7 +2253,7 @@ static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) { struct _TEB *teb; - __asm__(".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb)); + __asm__(".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb) :: "memory"); return teb; } #elif defined(__x86_64__) && defined(_MSC_VER) && !defined(WINE_UNIX_LIB) @@ -2267,7 +2267,7 @@ static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) { struct _TEB *teb; - __asm__("mrc p15, 0, %0, c13, c0, 2" : "=r" (teb)); + __asm__("mrc p15, 0, %0, c13, c0, 2" : "=r" (teb) :: "memory"); return teb; } #elif defined(__arm__) && defined(_MSC_VER) && !defined(WINE_UNIX_LIB)
Wouldn't `volatile` be sufficient in this case?
**EDIT**: IMHO Adding a reference to a global variable as a memory input operand should solve the mentioned problem without preventing optimization too much. ``` static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) { extern int dummy; struct _TEB *teb; __asm__(".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb) : "m" (dummy)); return teb; } ``` GCC won't optimize away the asm invocations if there is a function call that could potentially modify the `dummy` variable (any function GCC can't prove it doesn't modify the `dummy` variable) between the asm invocations.
Wouldn't `volatile` be sufficient in this case?
Indeed, it looks like `volatile` is more suitable since it's friendlier to GCC's `__builtin_ia32_wrgsbase64` intrinsic, which is defined in [`gcc/config/i386/i386.md`] (GCC 12.2) as having an [`unspec_volatile`] side-effect but not a memory clobber. A short test at https://godbolt.org/z/G5fWPTjr5 seems to confirm this.
[`gcc/config/i386/i386.md`]: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/i386.md;h=be07be... [`unspec_volatile`]: https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gccint/Side-Effects.html#index-uns...
**EDIT**: IMHO Adding a reference to a global variable as a memory input operand should solve the mentioned problem without preventing optimization too much.
static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) { extern int dummy; struct _TEB *teb; __asm__(".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb) : "m" (dummy)); return teb; }
GCC won't optimize away the asm invocations if there is a function call that could potentially modify the `dummy` variable (any function GCC can't prove it doesn't modify the `dummy` variable) between the asm invocations.
I'm not sure if this works reliably with whole-program link-time optimization.
Closing until I find a better solution for this.
This merge request was closed by Jinoh Kang.
On Mon Mar 20 13:34:56 2023 +0000, Jinoh Kang wrote:
Wouldn't `volatile` be sufficient in this case?
Indeed, it looks like `volatile` is more suitable since it's friendlier to GCC's `__builtin_ia32_wrgsbase64` intrinsic, which is defined in [`gcc/config/i386/i386.md`] (GCC 12.2) as having an [`unspec_volatile`] side-effect but not a memory clobber. A short test at https://godbolt.org/z/G5fWPTjr5 seems to confirm this. [`gcc/config/i386/i386.md`]: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/i386.md;h=be07be... [`unspec_volatile`]: https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gccint/Side-Effects.html#index-uns...
**EDIT**: IMHO Adding a reference to a global variable as a memory
input operand should solve the mentioned problem without preventing optimization too much.
static FORCEINLINE struct _TEB * WINAPI NtCurrentTeb(void) { extern int dummy; struct _TEB *teb; __asm__(".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb) : "m" (dummy)); return teb; }
GCC won't optimize away the asm invocations if there is a function
call that could potentially modify the `dummy` variable (any function GCC can't prove it doesn't modify the `dummy` variable) between the asm invocations. I'm not sure if this works reliably with whole-program link-time optimization.
I don't see how can `NtCurrentTeb()` change between calls within the same function ("function" here means what is resulted after inlining)? Do you have a (hypothetical) example of such code?
I don't see how can `NtCurrentTeb()` change between calls within the same function ("function" here means what is resulted after inlining)? Do you have a (hypothetical) example of such code?
TEB changes within `fiberProc` function: ``` #include <windows.h> #include <stdio.h>
static inline __attribute__((always_inline)) void* getTEB() { void* teb; __asm__ (".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb)); return teb; }
static inline __attribute__((always_inline)) void* getTEB_volatile() { void* teb; __asm__ volatile (".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb)); return teb; }
void* testFiber; void* mainThrFiber; void* secondThrFiber;
static void WINAPI fiberProc(void* arg); static DWORD WINAPI threadProc(void* arg);
int main() { mainThrFiber = ConvertThreadToFiber(NULL); testFiber = CreateFiber(0, fiberProc, NULL);
printf("main TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile());
puts("main -> fiber"); SwitchToFiber(testFiber); puts("main <- fiber");
HANDLE hThr = CreateThread(NULL, 0, threadProc, NULL, 0, NULL); WaitForSingleObject(hThr, INFINITE);
return 0; }
DWORD WINAPI threadProc(void* arg) { secondThrFiber = ConvertThreadToFiber(NULL);
printf("thread TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile());
puts("thread -> fiber"); SwitchToFiber(testFiber); puts("thread <- fiber");
return 0; }
void WINAPI fiberProc(void* arg) { printf("fiber TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile()); SwitchToFiber(mainThrFiber); printf("fiber TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile());
SwitchToFiber(secondThrFiber); }
```
On Mon Mar 20 21:06:13 2023 +0000, Mateusz Wajchęprzełóż wrote:
I don't see how can `NtCurrentTeb()` change between calls within the
same function ("function" here means what is resulted after inlining)? Do you have a (hypothetical) example of such code? TEB changes within `fiberProc` function:
#include <windows.h> #include <stdio.h> static inline __attribute__((always_inline)) void* getTEB() { void* teb; __asm__ (".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb)); return teb; } static inline __attribute__((always_inline)) void* getTEB_volatile() { void* teb; __asm__ volatile (".byte 0x65\n\tmovq (0x30),%0" : "=r" (teb)); return teb; } void* testFiber; void* mainThrFiber; void* secondThrFiber; static void WINAPI fiberProc(void* arg); static DWORD WINAPI threadProc(void* arg); int main() { mainThrFiber = ConvertThreadToFiber(NULL); testFiber = CreateFiber(0, fiberProc, NULL); printf("main TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile()); puts("main -> fiber"); SwitchToFiber(testFiber); puts("main <- fiber"); HANDLE hThr = CreateThread(NULL, 0, threadProc, NULL, 0, NULL); WaitForSingleObject(hThr, INFINITE); return 0; } DWORD WINAPI threadProc(void* arg) { secondThrFiber = ConvertThreadToFiber(NULL); printf("thread TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile()); puts("thread -> fiber"); SwitchToFiber(testFiber); puts("thread <- fiber"); return 0; } void WINAPI fiberProc(void* arg) { printf("fiber TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile()); SwitchToFiber(mainThrFiber); printf("fiber TEB=%p; volatile TEB=%p\n", getTEB(), getTEB_volatile()); SwitchToFiber(secondThrFiber); }
I see, thanks for the example with fibers, although this seems purely theoretical as Wine code doesn't use them…
although this seems purely
theoretical as Wine code doesn't use them
FWIW, kernel32:fiber tests can use `NtCurrentTeb()` to test if thread switch has actually happened. A little less useful in other places though.