Technically for loading host root store we don't need any custom PE CSPs. It might happen we'd need even custom CSPs for the CRYPT_RegReadSerializedFromReg() part but not immediately sure if we do. In case we don't it is interesting where exactly we end loading those custom CSPs and maybe we can stop doing that. If we do need those custom CSPs for some certs serialized in registry this is getting complicated, can't be theorizing on this without some deeper look.
Regarding the solution solving exactly the deadlock avoiding recursion. Named semaphore is there for reasons, it syncs things cross process. Adding a critical section on top is weird and also I'd advise against using sync object internals. This could probably be done by using CreateMutex instead of semaphore (those mutices are re-entrant), and then adding a var with thread id which took the lock, to check that after taking mutex so if this is the same thread just release the mutex and exit the function.
With this ad-hoc solution I won't at least be whining that it is going to break everything, but I still don't see why wouldn't we explore a proper way at least. Perhaps we might listen to another opinion, if @hans could comment if a simplified solution for this specific usecase looks reasonable to him.