What I would try to do (and maybe one day I'll have time to actually try), instead of sidestepping pthreads calling cleanup handlers, would be to arrange things so that pthreads can actually call cleanup handlers and they do they right thing. I still have to think this more, but I guess it should be attainable if we arrange the syscall stack so that each userspace entry/exit pair looks like a usual frame (maybe the task of cleaning it up could be delegated to a cleanup handler as well, if it is beneficial).
I think your proposal of avoid using the initial thread and only run Windows code in threads created by us would make everything easier. For one thing all threads would have the same stack layout (i.e., they would have a stack allocated and decided by us even from the point of view of pthreads; now, instead, for the main thread we are pivoting to a different stack without telling pthreads, which is why we need `exit_frame` I think).