Alexandre,
First off I apologize for the size of this email, I'm trying to keep it as concise as possible.
I've been experimenting with ways to optimize synchronization objects and have implemented a promising proof of concept for semaphores using glibc's nptl (posix) semaphore implementation. I posted revision 3 of this today, although I appear to have used the wrong msg id in the --in-reply-to header. :( So my goal is to eventually make similar optimizations for all synchronization objects, or at least those that have demonstrable performance problems.
The basic theory of operation is that when a client sends a create_semaphore, the server creates a posix semaphore with a unique name which is passes to the client process so that it can open it locally. This allows the client to perform ReleaseSemaphore without a server call as well as WaitFor(Multiple|Single)Object(s) for cases where the wait condition can be determined to be satisfied without a server call (i.e., either bWaitAll = FALSE and a signalled semaphore is found in the handle list prior to a non-semaphore objects or bWaitAll = TRUE and all handles are signalled semaphores). For all other conditions, it uses a traditional server call.
However, it has two problems:
1. It uses glibc's implementation of POSIX semaphores which uses shared memory to share them with other processes, and 2. It uses glibc's implementation of POSIX semaphores which are incompatible across 32- and 64-bit ABI processes.
I have not been able to find any more flaws in a case where both program and wineserver are the same ABI. All tests pass and I've added one more (although more tests are clearly needed). Since this implementation only uses sem_trywait (and never sem_wait or sem_timedwait), we don't really even need a full-featured semaphore -- a simple 32- or 16-bit number that's accessed atomically would suffice as a replacement. Although I did plan to eventually explore having a client program block w/o calling the server, the benefit of that is minimal compared to the benefit of being able to avoid the server call for releasing a semaphore and "wait"ing when the semaphore is already available.
So now I want to understand the minimum threshold of acceptability in wine for such a mechanism. We discussed this in chat and quite a bit and can I see many possibilities, each with its own particular issues. I'm listing them in order of my personal preference (most preferred first).
Option 1: Simple shared memory & roll our own semaphore Similar to what glibc's NPTL semaphores are doing, except that we would only need a single integral value and not even a futex. The obvious downside is that a process can corrupt this memory and cause dysfunction of other processes who also have semaphores in that page. This could be minimized by giving every process their own page that is only shared between the server and the process unless a semaphore in that process is shared with another program, at which time the memory page could be shared with that process as well. Thus, the scope of possible corruption is limited to how far you share the object.
In the worse case of memory corruption, the wineserver would either leave a thread of one of these processes hung, release one when it shouldn't be released or determine that the memory is corrupted, issue an error message, set the last error to something appropriate and return WAIT_FAILED.
Option 2: System V semaphores On Linux, these are hosted in the kernel, so you can't just accidentally overwrite them. They will be slightly slower than shared memory due to the system call overhead. You probably know them better than I, but at the risk or stating the obvious, the following are their limitations. Their max value on Linux is SHRT_MAX so any request for a higher lMaximumCount would have to be clipped. There are also limits on Linux that can be adjusted by root if needed for some application, but the default is a maximum of 32000 (SEMVMX) total semaphores on the system, 128 (SEMMNI) semaphore sets and a max size of 250 (SEMMSL) semaphores per set. They are also persistent, so if the wine server crashes, they can leave behind clutter.
Option 3: Move semaphores completely into the client In this scenario, the wine server can never be exposed to corrupted data. It is very fast when locking can be performed in the client, but very complicated and potentially slower for mixed locks. Calls to WaitForMultipleObjectsEx containing both semaphores and other objects (especially with bWaitAll = TRUE) may require multiple request/reply cycles to complete. The client must successfully lock the semaphores prior to the server calling satisfied on the server-side objects.
Here is an optimistic use case that only requires a single request/reply cycle
1. WaitForMultipleObjectsEx is called with bWaitAll = TRUE and a mix of semaphores and other objects 2. Client calls trywait on all semaphores, which succeeds. 3. Client passes request to server (with semaphore states) and blocks on pipe 4. Server gets value of all server-side objects and determines that the condition can be satisfied, so calls satisfied on all objects 5. Server sends response to client 6. Client wakes up and completes the wait call.
Here is a slightly less optimistic case
1. WaitForMultipleObjectsEx is called with bWaitAll = TRUE and a mix of semaphores and other objects 2. Client calls trywait on all semaphores, which fails on one semaphore. 3. Client rolls back locks on all which had succeeded. 4. Client passes request to server (with semaphore states) 5. Client blocks on the semaphore that was locked and the server pipe 6. Server updates thread status (blocking on native object) 7. Semaphore is signaled and client wakes up 8. Lock is obtained on semaphore that was previously locked 9. Client now calls trywait on remaining semaphores which again succeeds. 10. Client sends update to server and blocks on pipe 11. Server checks all server-side objects, which are all signaled, so calls satisfied on all objects 12. Server updates thread status and notifies client 13. Client wakes up and completes wait call.
As you can see this can get more complicated. If the server discovers that an server object isn't signaled it will have to notify the client to rollback the locks and wait for server objects to be ready.
So which of these solution is most appealing to you?