Re: Fast thread-local storage for OpenGL drivers

22 Feb 2003


      On Sat, Feb 22, 2003 at 09:51:26AM -0800, Gareth Hughes wrote:
...
Roland McGrath wrote:
...
These people clearly haven't read all of the TLS paper, or looked at the
GCC implementation of __thread long enough to notice -ftls-model and
__attribute__ ((tls_model)).
This is what I was talking about.  I've read the entire document several
times, and still can't see a way that a dynamically loadable shared library
can be guaranteed to use the single-instruction Local Exec access model.  If
I'm wrong, please explain why.
...
I think the TLS document intends to explain what the models mean in
practical terms on each architecture, but I can believe it's not all
that clear.  The GCC manual doesn't explain the access models and code
sequences, just tells you how to tell the compiler what you want in the
terms that the TLS document defines.
If you want maximal flexibility, i.e. to always work with dlopen, then
indeed you must use the "dynamic" TLS access models (GD or LD).  You can
use the Initial Exec model if you want faster accesses at the cost of some
flexibility.
libGL.so simply has to work with dlopen -- if for no other reason than
essentially all major 3D games (Quake3, Doom3, UT2003 etc) dlopen libGL.so
rather than linking with it.  This is not going to change.
Note the "always" in Roland's paragraph.
...
...
In glibc, we actually allocate some excess space in the thread-local
storage area layout determined at startup time.  This lets a dynamically
loaded module use static TLS if its PT_TLS segment fits in the available
surplus.  (In sysdeps/generic/dl-tls.c, see TLS_STATIC_SURPLUS.)  If there
is insufficient space preallocated, then loading the module will fail.  In
fact, we put this feature there with GL in mind and can adjust the
preallocated surplus for what is most useful in practice.
I think the set of performance critical thread-local variables is something
like two or three (depending on the implementation).  The libGL.so API
dispatcher needs fast access to one or two of these (dispatch table
pointers), while the driver backend needs fast access to all of them
(context pointer and dispatch table pointers).  The other thread-local
variables are generally not accessed in performance-critical situations.
When you say two or three, are these two or three pointers or two or
three large tables?
In any case, it sounds like you could:
 - select the thread-local variables that you need fast access to
 - Arrange for those variables to be tagged with an
   __attribute__((tls_model("initial-exec"))), or something similar.
 - Make sure the TLS_STATIC_SURPLUS is big enough to hold them.
...
Another issue I forgot to mention, or forgot to make clear, is that we need
to be able to access these thread-local variables in runtime generated code.
A driver's top-level API functions are often generated at runtime, and need
to be able to do things like switch dispatch tables (obviously, they'd have
direct access to the context they were associated with, and so wouldn't need
to go through the pointer in TLS).  Are we guaranteed that the __thread
variables aren't going to move around?  How would we work out what code to
generate to access a given __thread variable?
I don't see a problem, but you'd have to do some serious reading of the
TLS ABI documents.... they're quite thorough.
-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: Fast thread-local storage for OpenGL drivers