Roland McGrath wrote:
These people clearly haven't read all of the TLS paper, or looked at the GCC implementation of __thread long enough to notice -ftls-model and __attribute__ ((tls_model)).
This is what I was talking about. I've read the entire document several times, and still can't see a way that a dynamically loadable shared library can be guaranteed to use the single-instruction Local Exec access model. If I'm wrong, please explain why.
I think the TLS document intends to explain what the models mean in practical terms on each architecture, but I can believe it's not all that clear. The GCC manual doesn't explain the access models and code sequences, just tells you how to tell the compiler what you want in the terms that the TLS document defines.
If you want maximal flexibility, i.e. to always work with dlopen, then indeed you must use the "dynamic" TLS access models (GD or LD). You can use the Initial Exec model if you want faster accesses at the cost of some flexibility.
libGL.so simply has to work with dlopen -- if for no other reason than essentially all major 3D games (Quake3, Doom3, UT2003 etc) dlopen libGL.so rather than linking with it. This is not going to change.
When compiling PIC, IE-model accesses have one additional indirection, i.e. loading the offset from the GOT just as the address of a global variable is loaded in PIC. See the instruction sequences in the TLS spec.
I'm pretty sure all implementations of OpenGL are not compiled as PIC at this point in time. That's a whole other discussion, however.
If you use static linking, these instruction sequences reduce to constants at link time (i.e. direct "%gs:NNN" accesses on x86).
Can you describe how I could use static linking here? As I said, libGL.so must be a dynamically loadable shared library. What we want is the single-instruction Local Exec access model. At this point in time, my understanding of the situation is that these are mutually exclusive requirements.
If you link a shared object containing IE-model access relocs, the object will have the DF_STATIC_TLS flag set. By the spec, this means that dlopen might refuse to load it.
As I said, not being able to dlopen libGL.so is unacceptable.
In glibc, we actually allocate some excess space in the thread-local storage area layout determined at startup time. This lets a dynamically loaded module use static TLS if its PT_TLS segment fits in the available surplus. (In sysdeps/generic/dl-tls.c, see TLS_STATIC_SURPLUS.) If there is insufficient space preallocated, then loading the module will fail. In fact, we put this feature there with GL in mind and can adjust the preallocated surplus for what is most useful in practice.
I think the set of performance critical thread-local variables is something like two or three (depending on the implementation). The libGL.so API dispatcher needs fast access to one or two of these (dispatch table pointers), while the driver backend needs fast access to all of them (context pointer and dispatch table pointers). The other thread-local variables are generally not accessed in performance-critical situations.
Another issue I forgot to mention, or forgot to make clear, is that we need to be able to access these thread-local variables in runtime generated code. A driver's top-level API functions are often generated at runtime, and need to be able to do things like switch dispatch tables (obviously, they'd have direct access to the context they were associated with, and so wouldn't need to go through the pointer in TLS). Are we guaranteed that the __thread variables aren't going to move around? How would we work out what code to generate to access a given __thread variable?
(I've included both phil-list and wine-devel, if you'd like this discussion kept to one or other of these lists, please say so).
-- Gareth Hughes (gareth@nvidia.com) OpenGL Developer, NVIDIA Corporation
On Sat, Feb 22, 2003 at 09:51:26AM -0800, Gareth Hughes wrote:
Roland McGrath wrote:
These people clearly haven't read all of the TLS paper, or looked at the GCC implementation of __thread long enough to notice -ftls-model and __attribute__ ((tls_model)).
This is what I was talking about. I've read the entire document several times, and still can't see a way that a dynamically loadable shared library can be guaranteed to use the single-instruction Local Exec access model. If I'm wrong, please explain why.
I think the TLS document intends to explain what the models mean in practical terms on each architecture, but I can believe it's not all that clear. The GCC manual doesn't explain the access models and code sequences, just tells you how to tell the compiler what you want in the terms that the TLS document defines.
If you want maximal flexibility, i.e. to always work with dlopen, then indeed you must use the "dynamic" TLS access models (GD or LD). You can use the Initial Exec model if you want faster accesses at the cost of some flexibility.
libGL.so simply has to work with dlopen -- if for no other reason than essentially all major 3D games (Quake3, Doom3, UT2003 etc) dlopen libGL.so rather than linking with it. This is not going to change.
Note the "always" in Roland's paragraph.
In glibc, we actually allocate some excess space in the thread-local storage area layout determined at startup time. This lets a dynamically loaded module use static TLS if its PT_TLS segment fits in the available surplus. (In sysdeps/generic/dl-tls.c, see TLS_STATIC_SURPLUS.) If there is insufficient space preallocated, then loading the module will fail. In fact, we put this feature there with GL in mind and can adjust the preallocated surplus for what is most useful in practice.
I think the set of performance critical thread-local variables is something like two or three (depending on the implementation). The libGL.so API dispatcher needs fast access to one or two of these (dispatch table pointers), while the driver backend needs fast access to all of them (context pointer and dispatch table pointers). The other thread-local variables are generally not accessed in performance-critical situations.
When you say two or three, are these two or three pointers or two or three large tables?
In any case, it sounds like you could: - select the thread-local variables that you need fast access to - Arrange for those variables to be tagged with an __attribute__((tls_model("initial-exec"))), or something similar. - Make sure the TLS_STATIC_SURPLUS is big enough to hold them.
Another issue I forgot to mention, or forgot to make clear, is that we need to be able to access these thread-local variables in runtime generated code. A driver's top-level API functions are often generated at runtime, and need to be able to do things like switch dispatch tables (obviously, they'd have direct access to the context they were associated with, and so wouldn't need to go through the pointer in TLS). Are we guaranteed that the __thread variables aren't going to move around? How would we work out what code to generate to access a given __thread variable?
I don't see a problem, but you'd have to do some serious reading of the TLS ABI documents.... they're quite thorough.
On Sat, Feb 22, 2003 at 09:51:26AM -0800, Gareth Hughes wrote:
This is what I was talking about. I've read the entire document several times, and still can't see a way that a dynamically loadable shared library can be guaranteed to use the single-instruction Local Exec access model. If I'm wrong, please explain why.
There are no guarantees, but glibc reserves a few bytes for dlopened PT_TLS segments in each thread's initial TLS block (this was added with GL in mind).
When compiling PIC, IE-model accesses have one additional indirection, i.e. loading the offset from the GOT just as the address of a global variable is loaded in PIC. See the instruction sequences in the TLS spec.
I'm pretty sure all implementations of OpenGL are not compiled as PIC at this point in time.
AFAIK on x86 only, but it is wrong everywhere.
If you use static linking, these instruction sequences reduce to constants at link time (i.e. direct "%gs:NNN" accesses on x86).
Can you describe how I could use static linking here? As I said, libGL.so must be a dynamically loadable shared library. What we want is the single-instruction Local Exec access model. At this point in time, my understanding of the situation is that these are mutually exclusive requirements.
On x86, ld supports Local Exec model in shared libraries (while for most other targets it does not). R_386_TLS_LE relocation is simply during -shared linking changed into R_386_TLS_TPOFF dynamic relocation (the same as is used for IE model, but there this reloc is against .got section while for LE it is against text section). So, if you don't use -fpic anyway, you can just use LE model on IA-32, if you finally change it so that -fpic is used for the whole library, then those functions (or assembly stubs) can be put into some SHF_ALLOC|SHF_WRITE|SHF_EXECINSTR section.
I think the set of performance critical thread-local variables is something like two or three (depending on the implementation). The libGL.so API dispatcher needs fast access to one or two of these (dispatch table pointers), while the driver backend needs fast access to all of them (context pointer and dispatch table pointers). The other thread-local variables are generally not accessed in performance-critical situations.
Which means you should use the default -ftls-model and use __attribute__((tls_model("initial-exec"))) or __attribute__((tls_model("local-exec"))) for the variables which are really performance critical.
Another issue I forgot to mention, or forgot to make clear, is that we need to be able to access these thread-local variables in runtime generated code. A driver's top-level API functions are often generated at runtime, and need to be able to do things like switch dispatch tables (obviously, they'd have direct access to the context they were associated with, and so wouldn't need to go through the pointer in TLS). Are we guaranteed that the __thread variables aren't going to move around? How would we work out what code to generate to access a given __thread variable?
On IA-32, you can use __asm ("jmp 1f; .section writetext, "awx"; 1: movl $foo@ntpoff, %0; jmp 2f; .previous; 2:" : "=r" (foo_offset)); to query some variable's offset which you can later on use with: __asm ("movl %gs:0(%1), %0" : "=r" (foo_value) : "r" (foo_offset)); Please do something like this only for runtime generated code, not for anything else.
Jakub