It is critically important for OpenGL drivers to have fast (single-instruction) access to thread local variables. I'd be happy to provide more information to anyone who's interested, but a typical case where TLS access can severely hurt performance is at the very front-end of an OpenGL library. Ideally, you'd like something like the following:
libGL.so: // This function loads a dispatch table pointer from // thread-local storage and jumps through to the // backend function (which typically resides in a // different shared library). glTexCoord2f: mov %fs:DISPATCH_TABLE_OFFSET, %eax jmp *__glapi_TexCoord2f(%eax) // Points to __my_TexCoord2f
libGLcore.so: // This function copies some data into the OpenGL // context, sets some magic flags to record what data // was copied, and returns. __my_TexCoord2f: mov %fs:CONTEXT_OFFSET, %eax // Copy 2 floats into the context // Set a flag ret
All in all, you have 2 TLS accesses in less than 10 instructions or so. Even if you don't understand exactly what's going on here, you can see that it is important to have fast access to thread-local data.
While glibc's new thread library implementation has many benefits, particularly to application programmers (with support for the new keyword '__thread', and so on), it basically forces a function call per thread local variable access for situations like the one I described above. This is clearly unacceptable for a high-performance OpenGL driver. Furthermore, the glibc developers have been completely unwilling to work with OpenGL driver developers (Open Source or otherwise) to provide a mechanism to access thread-local data in a way that meets our performance requirements.
Therefore, I'd like to propose a solution where Wine and the OpenGL driver cooperate to provide such a TLS access mechanism (at least on x86 platforms). Wine currently uses %fs to access the Windows Thread Environment Block (TEB), while glibc uses %gs to access its per-thread data. With the following patch to Wine's TEB structure:
--- include/thread.h 2002-12-17 16:06:25.000000000 -0500 +++ include/thread.h.new 2003-02-21 14:27:50.000000000 -0500 @@ -116,10 +116,12 @@ DWORD alarms; /* --3 22c Data for vm86 mode */ DWORD vm86_pending; /* --3 230 Data for vm86 mode */ void *vm86_ptr; /* --3 234 Data for vm86 mode */ + /* here is plenty space for wine specific fields (don't forget to change pad6!!) */ + DWORD pad6[608]; /* --n 238 */ + DWORD ogl_data[16]; /* --n bb8 OpenGL driver private data */
/* the following are nt specific fields */ - DWORD pad6[624]; /* --n 238 */ UNICODE_STRING StaticUnicodeString; /* -2- bf8 used by advapi32 */ USHORT StaticUnicodeBuffer[261]; /* -2- c00 used by advapi32 */ void *stack_base; /* -2- e0c Base of the stack */
we reserve %fs:0xbb8 to %fs:0xbf8 for use by the OpenGL driver. Any and all OpenGL implementations can use this area, and we agree that when Wine is present, it leaves this area untouched. The question of who allocates the TEB should be pretty straight forward: when an OpenGL driver is first loaded, if the TEB is missing it is allocated as expected. I would imagine when Wine is running that it would have the chance to allocate the TEB before the OpenGL driver is loaded, and thus the OpenGL driver wouldn't have to do anything. The size of the reserved area should be sufficient, although we can debate that if required.
Comments, questions are welcome. I've CC'ed Brian Paul and Keith Whitwell of Mesa/DRI fame, as I know they are interested in this issue. Please CC us on any replies, as we are not subscribed to the list.
-- Gareth Hughes (gareth@nvidia.com) OpenGL Developer, NVIDIA Corporation
Gareth Hughes wrote:
It is critically important for OpenGL drivers to have fast (single-instruction) access to thread local variables. ... While glibc's new thread library implementation has many benefits, particularly to application programmers (with support for the new keyword '__thread', and so on), it basically forces a function call per thread local variable access for situations like the one I described above. ... Comments, questions are welcome.
Hi Gareth, I forwarded your note to the NPTL mailing list. Roland McGrath replied, and suggests that you might want to reread the TLS paper (I think he's referring to http://people.redhat.com/drepper/nptl-design.pdf, which appears to be offline at the moment); also look at the GCC implementation of __thread, and note -ftls-model and __attribute__ ((tls_model)).
If that didn't make any sense, it's probably because I mangled Roland's words; you might want to ask him what he meant. - Dan
Dan Kegel wrote:
Gareth Hughes wrote:
It is critically important for OpenGL drivers to have fast (single-instruction) access to thread local variables. ... While glibc's new thread library implementation has many benefits, particularly to application programmers (with support for the new keyword '__thread', and so on), it basically forces a function call per thread local variable access for situations like the one I described above. ... Comments, questions are welcome.
Hi Gareth, I forwarded your note to the NPTL mailing list. Roland McGrath replied, and suggests that you might want to reread the TLS paper (I think he's referring to http://people.redhat.com/drepper/nptl-design.pdf, which appears to be offline at the moment) ...
I've been corrected. The two TLS documents are http://www.imodulo.com/gnu/gcc/Thread-Local.html and http://people.redhat.com/drepper/tls.pdf
There have been a few good replies on the NPTL list; see e.g. https://listman.redhat.com/pipermail/phil-list/2003-February/000615.html in which Roland expands a bit on his first post. - Dan
Dan Kegel wrote:
There have been a few good replies on the NPTL list; see e.g. https://listman.redhat.com/pipermail/phil-list/2003-February/000615.html in which Roland expands a bit on his first post.
Here's the upshot of Roland's post:
OpenGL apps *can* avoid the function call Gareth was worried about. See section 4.3, "Initial Exec TLS Model", in http://people.redhat.com/drepper/tls.pdf and the code sequences on pages 34-37. As Roland points out, this requires glibc to preallocate a little extra space, but they planned for this -- in fact, they had OpenGL in mind when they did it.
It does look like the TLS model does what you want it to, and no new methods are needed. Can you explain in more detail why your new proposal is needed, if you still think it is?
Thanks, Dan