Re: Fast thread-local storage for OpenGL drivers

22 Feb 2003


      On Sat, Feb 22, 2003 at 09:51:26AM -0800, Gareth Hughes wrote:
...
This is what I was talking about.  I've read the entire document several
times, and still can't see a way that a dynamically loadable shared library
can be guaranteed to use the single-instruction Local Exec access model.  If
I'm wrong, please explain why.
There are no guarantees, but glibc reserves a few bytes for dlopened PT_TLS
segments in each thread's initial TLS block (this was added with GL in
mind).
...
...
When compiling PIC, IE-model accesses have one additional indirection,
i.e. loading the offset from the GOT just as the address of a global
variable is loaded in PIC.  See the instruction sequences in the TLS spec.
I'm pretty sure all implementations of OpenGL are not compiled as PIC at
this point in time.
AFAIK on x86 only, but it is wrong everywhere.
...
...
If you use static linking, these instruction sequences reduce to constants
at link time (i.e. direct "%gs:NNN" accesses on x86).
Can you describe how I could use static linking here?  As I said, libGL.so
must be a dynamically loadable shared library.  What we want is the
single-instruction Local Exec access model.  At this point in time, my
understanding of the situation is that these are mutually exclusive
requirements.
On x86, ld supports Local Exec model in shared libraries (while for most
other targets it does not). R_386_TLS_LE relocation is simply during
-shared linking changed into R_386_TLS_TPOFF dynamic relocation (the same
as is used for IE model, but there this reloc is against .got section while
for LE it is against text section).
So, if you don't use -fpic anyway, you can just use LE model on IA-32,
if you finally change it so that -fpic is used for the whole library,
then those functions (or assembly stubs) can be put into
some SHF_ALLOC|SHF_WRITE|SHF_EXECINSTR section.
...
I think the set of performance critical thread-local variables is something
like two or three (depending on the implementation).  The libGL.so API
dispatcher needs fast access to one or two of these (dispatch table
pointers), while the driver backend needs fast access to all of them
(context pointer and dispatch table pointers).  The other thread-local
variables are generally not accessed in performance-critical situations.
Which means you should use the default -ftls-model and use
__attribute__((tls_model("initial-exec"))) or
__attribute__((tls_model("local-exec")))
for the variables which are really performance critical.
...
Another issue I forgot to mention, or forgot to make clear, is that we need
to be able to access these thread-local variables in runtime generated code.
A driver's top-level API functions are often generated at runtime, and need
to be able to do things like switch dispatch tables (obviously, they'd have
direct access to the context they were associated with, and so wouldn't need
to go through the pointer in TLS).  Are we guaranteed that the __thread
variables aren't going to move around?  How would we work out what code to
generate to access a given __thread variable?
On IA-32, you can use
__asm ("jmp 1f; .section writetext, "awx"; 1: movl $foo@ntpoff, %0; jmp 2f; .previous; 2:" : "=r" (foo_offset));
to query some variable's offset which you can later on use with:
__asm ("movl %gs:0(%1), %0" : "=r" (foo_value) : "r" (foo_offset));
Please do something like this only for runtime generated code, not for
anything else.
Jakub

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: Fast thread-local storage for OpenGL drivers