Hi Ken,
I've attached the hack as attachment, but let me explain the theory first. ;)
After checking a couple of 64-bit applications (VLC, Teamspeak 3, ...), they were mostly using a single TEB field, which is %%gs:0x30 (Teb->Tib.Self). My theory is that they use the NtCurrentTeb() inline function from the MS header files. As you are aware, just changing the %gs segment is not possible, but in order to get these applications working we just need to set %%gs:0x30 correctly.
I looked through the last public pthread implementation on OS X and saw that %%gs:0x30 collides with the 6th TLS field. In fact the first 256 fields are reserved for system use, but there is some documentation available how those fields are used (http://opensource.apple.com/source/Libc/Libc-825.40.1/pthreads/pthread_machd...). As you can see the fields 1-9 are reserved for dyld, but only the fields 1,2,3 and 8 are currently in use. The fields 4-7 are reserved for future usage. I therefore looked through the dyld source code to see if those fields might be in use now. To my surprise current versions of dyld don't seem to use hardcoded TLS indices any more. For example field 8 was reserved for Unwind_SjLj and when you take a look at the current source code (http://opensource.apple.com/source/dyld/dyld-360.18/src/dyldAPIs.cpp - function at line 1261) you see that the key is dynamically created using pthread_key_create which only returns fields > 256. Moreover, all the code uses pthread_setspecific which doesn't allow to set fields < 10. There is an exception, when a library requests TLS storage, the values are directly changed using the gs segment, but even in this case the keys are dynamically allocated using pthread_key_create. So far I have found no evidence (neither source code or tests) that those fields are actually used or that the pthread implementation changed in this regard. I therefore assume that overwriting this single field should cause no harm and I gave the hack a try.
Anyway, I only spent like an hour yesterday to investigate this and I my have overlooked something. You are of course also welcome to do your own investigations. When my theory holds, we might also be able to optimize Wines NtCurrentTeb() again by using %%gs:0x30 instead of using the pthread_getspecific function.
Regards, Michael