Alexandre,
First off I apologize for the size of this email, I'm trying to keep it as concise as possible.
I've been experimenting with ways to optimize synchronization objects and have implemented a promising proof of concept for semaphores using glibc's nptl (posix) semaphore implementation. I posted revision 3 of this today, although I appear to have used the wrong msg id in the --in-reply-to header. :( So my goal is to eventually make similar optimizations for all synchronization objects, or at least those that have demonstrable performance problems.
The basic theory of operation is that when a client sends a create_semaphore, the server creates a posix semaphore with a unique name which is passes to the client process so that it can open it locally. This allows the client to perform ReleaseSemaphore without a server call as well as WaitFor(Multiple|Single)Object(s) for cases where the wait condition can be determined to be satisfied without a server call (i.e., either bWaitAll = FALSE and a signalled semaphore is found in the handle list prior to a non-semaphore objects or bWaitAll = TRUE and all handles are signalled semaphores). For all other conditions, it uses a traditional server call.
However, it has two problems:
1. It uses glibc's implementation of POSIX semaphores which uses shared memory to share them with other processes, and 2. It uses glibc's implementation of POSIX semaphores which are incompatible across 32- and 64-bit ABI processes.
I have not been able to find any more flaws in a case where both program and wineserver are the same ABI. All tests pass and I've added one more (although more tests are clearly needed). Since this implementation only uses sem_trywait (and never sem_wait or sem_timedwait), we don't really even need a full-featured semaphore -- a simple 32- or 16-bit number that's accessed atomically would suffice as a replacement. Although I did plan to eventually explore having a client program block w/o calling the server, the benefit of that is minimal compared to the benefit of being able to avoid the server call for releasing a semaphore and "wait"ing when the semaphore is already available.
So now I want to understand the minimum threshold of acceptability in wine for such a mechanism. We discussed this in chat and quite a bit and can I see many possibilities, each with its own particular issues. I'm listing them in order of my personal preference (most preferred first).
Option 1: Simple shared memory & roll our own semaphore Similar to what glibc's NPTL semaphores are doing, except that we would only need a single integral value and not even a futex. The obvious downside is that a process can corrupt this memory and cause dysfunction of other processes who also have semaphores in that page. This could be minimized by giving every process their own page that is only shared between the server and the process unless a semaphore in that process is shared with another program, at which time the memory page could be shared with that process as well. Thus, the scope of possible corruption is limited to how far you share the object.
In the worse case of memory corruption, the wineserver would either leave a thread of one of these processes hung, release one when it shouldn't be released or determine that the memory is corrupted, issue an error message, set the last error to something appropriate and return WAIT_FAILED.
Option 2: System V semaphores On Linux, these are hosted in the kernel, so you can't just accidentally overwrite them. They will be slightly slower than shared memory due to the system call overhead. You probably know them better than I, but at the risk or stating the obvious, the following are their limitations. Their max value on Linux is SHRT_MAX so any request for a higher lMaximumCount would have to be clipped. There are also limits on Linux that can be adjusted by root if needed for some application, but the default is a maximum of 32000 (SEMVMX) total semaphores on the system, 128 (SEMMNI) semaphore sets and a max size of 250 (SEMMSL) semaphores per set. They are also persistent, so if the wine server crashes, they can leave behind clutter.
Option 3: Move semaphores completely into the client In this scenario, the wine server can never be exposed to corrupted data. It is very fast when locking can be performed in the client, but very complicated and potentially slower for mixed locks. Calls to WaitForMultipleObjectsEx containing both semaphores and other objects (especially with bWaitAll = TRUE) may require multiple request/reply cycles to complete. The client must successfully lock the semaphores prior to the server calling satisfied on the server-side objects.
Here is an optimistic use case that only requires a single request/reply cycle
1. WaitForMultipleObjectsEx is called with bWaitAll = TRUE and a mix of semaphores and other objects 2. Client calls trywait on all semaphores, which succeeds. 3. Client passes request to server (with semaphore states) and blocks on pipe 4. Server gets value of all server-side objects and determines that the condition can be satisfied, so calls satisfied on all objects 5. Server sends response to client 6. Client wakes up and completes the wait call.
Here is a slightly less optimistic case
1. WaitForMultipleObjectsEx is called with bWaitAll = TRUE and a mix of semaphores and other objects 2. Client calls trywait on all semaphores, which fails on one semaphore. 3. Client rolls back locks on all which had succeeded. 4. Client passes request to server (with semaphore states) 5. Client blocks on the semaphore that was locked and the server pipe 6. Server updates thread status (blocking on native object) 7. Semaphore is signaled and client wakes up 8. Lock is obtained on semaphore that was previously locked 9. Client now calls trywait on remaining semaphores which again succeeds. 10. Client sends update to server and blocks on pipe 11. Server checks all server-side objects, which are all signaled, so calls satisfied on all objects 12. Server updates thread status and notifies client 13. Client wakes up and completes wait call.
As you can see this can get more complicated. If the server discovers that an server object isn't signaled it will have to notify the client to rollback the locks and wait for server objects to be ready.
So which of these solution is most appealing to you?
Daniel Santos daniel.santos@pobox.com writes:
As you can see this can get more complicated. If the server discovers that an server object isn't signaled it will have to notify the client to rollback the locks and wait for server objects to be ready.
So which of these solution is most appealing to you?
None of them, really.
Using shared memory means you'll need one page per object, because you can't allow a process to corrupt the state of unrelated objects. This doesn't scale.
SysV synchronization objects are broken in various ways and should be avoided.
I think the most promising approach would be to add a process-local cache for objects that don't need inter-process synchronization. As soon as the object needs complex work that involves the server or another process (DuplicateHandle, WaitForMultipleObjects, named objects, etc.) it graduates to a full server object, and from then on goes through the normal path. This should cover the most performance-critical cases.
On 09/14/2015 02:18 AM, Alexandre Julliard wrote:
Daniel Santos daniel.santos@pobox.com writes:
As you can see this can get more complicated. If the server discovers that an server object isn't signaled it will have to notify the client to rollback the locks and wait for server objects to be ready.
So which of these solution is most appealing to you?
None of them, really.
Using shared memory means you'll need one page per object, because you can't allow a process to corrupt the state of unrelated objects. This doesn't scale.
I have considered this as well, and my idea was to have a shared page (or set of pages if needed) for each unique combination of processes that are sharing objects. Then these pages will only contain objects that are shared by all of those processes. The synchronization object in private memory is then "moved" (or just redirects) to the new object in shared memory and any threads waiting on the old one are signaled and a flag in the old object informs them of the move, so they can start over on the new object. Actually, the original object doesn't go away, it just becomes a proxy. This can happen again if the sharing scheme changes to involve another process, except that the now-intermediary object can actually be "deleted" once there are no more references to it (not really a call to free() of course).
SysV synchronization objects are broken in various ways and should be avoided.
That's good to know!
I think the most promising approach would be to add a process-local cache for objects that don't need inter-process synchronization. As soon as the object needs complex work that involves the server or another process (DuplicateHandle, WaitForMultipleObjects, named objects, etc.) it graduates to a full server object, and from then on goes through the normal path. This should cover the most performance-critical cases.
That is actually a wonderful idea! The struct ntdll_object & associated rbtree and list added to om.c in the patch set is exactly for this process-level cache. Rather the object is inter-process or not can just be data in that cache object. This would seem to add the most for the least complexity and exposure to security and corruption issues. This approach should fix most performance issues and still be very simple! :) Thanks!
I was going to have to ask about the message queues needing to have their signaled() functions called during a select call (which was a complication I hadn't mentioned) but this avoids that issues now.
Daniel
Daniel Santos daniel.santos@pobox.com writes:
That is actually a wonderful idea! The struct ntdll_object & associated rbtree and list added to om.c in the patch set is exactly for this process-level cache. Rather the object is inter-process or not can just be data in that cache object. This would seem to add the most for the least complexity and exposure to security and corruption issues. This approach should fix most performance issues and still be very simple! :) Thanks!
If you think that it's very simple then probably you are missing something ;-)
I think it's doable, at least for some object types, but it's far from simple, there's still a lot of complexity involved. Consider for instance a DuplicateHandle call from a different process.
On 09/14/2015 03:11 AM, Alexandre Julliard wrote:
If you think that it's very simple then probably you are missing something ;-)
I think it's doable, at least for some object types, but it's far from simple, there's still a lot of complexity involved. Consider for instance a DuplicateHandle call from a different process.
Sorry for my delayed response. Yes, for instance waiting threads still have to be signaled and informed of the change when the object graduates to a server-side object, and all of this must be completed prior to the call that shares the object (or the first use of it in the 2nd process) returns. I still think it's simpler than some of the other alternatives. Off the top of my head (for semaphores), I can think of OpenSemaphore, DuplicateHandle and CreateProcess with bInheritHandles = TRUE, but I can hunt that down later.
The part I'm not sure about yet is how we notify the first process that their object is now shared. My most sane thought for doing this (I have plenty of others :) is adding a more generic mechanism to tell the client "I have a message for you", but both SIGUSR1 and 2 are both being used. The handler for SIGUSR1 could be modified so that it first asks the server what the message is, which may be a "suspend thread" or "migrate object to server." So before proceeding, I would like to know your preference on how this should be handled.
Also, I'm wondering if migration should occur the moment the handle is shared, or should it be delayed until the handle is actually used? I'm almost certain that it's the former, but I can see some performance benefits in the case of CreateProcess(bInheritHandles = TRUE) where the new processes doesn't actually use the now "shared" synchronization object. The first solution keeps the server code a bit cleaner as well, so that all of the request handlers for various functions that use a shared object don't have to check to see if it needs to be migrated.
I wish I could make it to wineconf! :)
Daniel / /
The part I'm not sure about yet is how we notify the first process that their object is now shared. My most sane thought for doing this (I have plenty of others :) is adding a more generic mechanism to tell the client "I have a message for you", but both SIGUSR1 and 2 are both being used. The handler for SIGUSR1 could be modified so that it first asks the server what the message is, which may be a "suspend thread" or "migrate object to server." So before proceeding, I would like to know your preference on how this should be handled.
FWIW, in our discussion on #winehackers, I was assuming you'd do the equivalent of InterlockedExchange in shared memory (between wineserver and the client) that contains the object state. The client would notice as soon as it tried to do anything with the object that the object was moved to the server, and the server would know the previous state from the result of InterlockedExchange. (This assumes every valid object state, including a state indicating that it's in the server, can fit in 64 bits.)
I'm not sure it's a good idea for the wineserver to have to wait on code in a windows process (which, if the process is really hosed, might never respond) before it can reply to a DuplicateHandle request.
The only time you'd need to signal the client is when a thread is waiting on the object being moved, so that the thread can restart the wait using the server. I don't know how you plan to wait on process-local objects, but perhaps you can perform the wait in such a way that the server can interrupt it easily, without disturbing the thread if it's doing anything else?
On 09/15/2015 08:26 PM, Vincent Povirk wrote:
The part I'm not sure about yet is how we notify the first process that their object is now shared. My most sane thought for doing this (I have plenty of others :) is adding a more generic mechanism to tell the client "I have a message for you", but both SIGUSR1 and 2 are both being used. The handler for SIGUSR1 could be modified so that it first asks the server what the message is, which may be a "suspend thread" or "migrate object to server." So before proceeding, I would like to know your preference on how this should be handled.
FWIW, in our discussion on #winehackers, I was assuming you'd do the equivalent of InterlockedExchange in shared memory (between wineserver and the client) that contains the object state.
Well, Alexandre doesn't like that approach. I still think that it's doable to manage such a thing and still have the wineserver know when something isn't right and respond accordingly and I would still love to be able to convince Alexandre of that! :)
The client would notice as soon as it tried to do anything with the object that the object was moved to the server, and the server would know the previous state from the result of InterlockedExchange. (This assumes every valid object state, including a state indicating that it's in the server, can fit in 64 bits.)
I think that the idea Alexandre proposed is that the object will not use shared memory, so the "moving to the server" scenario occurs when a complex WaitForMultipleObjects (i.e., with mixed object types) is requested or another process gains access to the object.
I'm not sure it's a good idea for the wineserver to have to wait on code in a windows process (which, if the process is really hosed, might never respond) before it can reply to a DuplicateHandle request.
Well, the wineserver can't actually block, but somewhere we have to cause the calling thread of the second process to block until the object is migrated to the sever. I think that there's 3 ways to do this.
1. When the second process makes a call to get a handle to the (previously private) object, block the calling thread (delay the response) until the object is migrated to the server, or 2. When the second process makes a call to get a handle to the (previously private) object, trigger a migration but don't block the calling thread (pending a completed migration) until it makes a call to actually *use* that object (ReleaseSemaphore, QuerySemaphore, server_select, etc.), or 3. Don't do anything until the second process makes a call to actually use the object, then trigger the migration and wait for it to finish.
For simplicity, I was heading for the first option, but greater throughput could be obtained by the second option (if this is happening a lot) and a process will have a greater chance of getting to use an optimized (local) object with the 3rd option (not that I think it's a good option, just trying to be complete).
The only time you'd need to signal the client is when a thread is waiting on the object being moved, so that the thread can restart the wait using the server.
Ah, another good reason to share memory between client & server! :) Also, it would enable wait multiple with mixed object types w/o forcing the object to be migrated to the server too.
I don't know how you plan to wait on process-local objects, but perhaps you can perform the wait in such a way that the server can interrupt it easily, without disturbing the thread if it's doing anything else?
I don't have a complete design for this yet. I'm thinking that a custom synchronization mechanism using futexes would be appropriate (not that I've done that before or even used futexes) so we just notify a futex. Without the shared memory, I was thinking of having the server send SIGUSR1 (or some such) to the client and that handler would send a request to the server to see what it wanted. The server would reply by telling it to migrate some object and then the client would manage this from the handler. So with just kill(), I don't think you get a choice over which thread responds to it, so the handler would just need to be aware if it was actually waiting on a futex when it was signaled and manage that appropriately. Again, I haven't really worked that part out just yet.
Using shared memory is certainly another option -- then we just have the server mark the object as needing migration and signal a futex. When a wait is completed on the client, it should just always read that value to know why it was signaled.
Well, Alexandre doesn't like that approach. I still think that it's doable to manage such a thing and still have the wineserver know when something isn't right and respond accordingly and I would still love to be able to convince Alexandre of that! :)
Well, I didn't read what he said that way:
On Mon, Sep 14, 2015 at 2:18 AM, Alexandre Julliard julliard@winehq.org wrote:
Using shared memory means you'll need one page per object, because you can't allow a process to corrupt the state of unrelated objects. This doesn't scale.
This is true only if you share memory between multiple clients. If you share memory only between wineserver and individual clients, you need at least one page per client, but you can use the entire page (and most of the overhead can remain in the individual processes, ideally you only need 32 or 64 bits of shared memory per object).
Maybe there's another objection to this approach, but I didn't see it in this thread.
On 09/16/2015 10:49 AM, Vincent Povirk wrote:
Well, I didn't read what he said that way:
On Mon, Sep 14, 2015 at 2:18 AM, Alexandre Julliard julliard@winehq.org wrote:
Using shared memory means you'll need one page per object, because you can't allow a process to corrupt the state of unrelated objects. This doesn't scale.
This is true only if you share memory between multiple clients. If you share memory only between wineserver and individual clients, you need at least one page per client, but you can use the entire page (and most of the overhead can remain in the individual processes, ideally you only need 32 or 64 bits of shared memory per object).
Maybe there's another objection to this approach, but I didn't see it in this thread.
Actually, this does not need to be true if you share memory between multiple clients. Every unique combination of processes involved in sharing objects can have it's own "process group" where those objects, and only those objects reside. Further, to identify corruption, each process can keep accounting on its interaction with each object which can be used to validate that the memory has not been altered outside of proper use of the API. For a semaphore, I think that 32 bits will be enough to and leave plenty of room left over for various flags, like one for "object being moved" and another for "server wants a notification."
If an additional process gains access to the object it can just be moved again to the new process group. This design would allow complex waits to occur on the server without requiring the object to be migrated. Of course the server would never block, it will rely upon the client to signal it in some form when a "signaling" function is performed (ReleaseMutex, ReleaseSemaphore, WakeConditionVariable, etc.).
What are your thoughts Alexandre? If I implement it and provide tests to reasonably prove that memory corruption cannot affect the wineserver's function, nor the stability of other programs that are not sharing objects will you consider it at some point? I know that something like this would have to sit in staging for a long time, but I just don't want to build something that has no chance of being merged.
Thanks, Daniel
On 09/17/2015 09:36 AM, Daniel Santos wrote:
On 09/16/2015 10:49 AM, Vincent Povirk wrote:
Well, I didn't read what he said that way:
On Mon, Sep 14, 2015 at 2:18 AM, Alexandre Julliard julliard@winehq.org wrote:
Using shared memory means you'll need one page per object, because you can't allow a process to corrupt the state of unrelated objects. This doesn't scale.
This is true only if you share memory between multiple clients. If you share memory only between wineserver and individual clients, you need at least one page per client, but you can use the entire page (and most of the overhead can remain in the individual processes, ideally you only need 32 or 64 bits of shared memory per object).
Maybe there's another objection to this approach, but I didn't see it in this thread.
Actually, this does not need to be true if you share memory between multiple clients. Every unique combination of processes involved in sharing objects can have it's own "process group" where those objects, and only those objects reside. Further, to identify corruption, each process can keep accounting on its interaction with each object which can be used to validate that the memory has not been altered outside of proper use of the API. For a semaphore, I think that 32 bits will be enough to and leave plenty of room left over for various flags, like one for "object being moved" and another for "server wants a notification."
Did you look at memfd as opposed to plain share memory and how that can simplify your memory corruption problem? I know that Michael Müller worked on that and proof of concept code is available, see https://github.com/wine-compholio/wine-staging/tree/master/patches/server-Sh...
If an additional process gains access to the object it can just be moved again to the new process group. This design would allow complex waits to occur on the server without requiring the object to be migrated. Of course the server would never block, it will rely upon the client to signal it in some form when a "signaling" function is performed (ReleaseMutex, ReleaseSemaphore, WakeConditionVariable, etc.).
What are your thoughts Alexandre? If I implement it and provide tests to reasonably prove that memory corruption cannot affect the wineserver's function, nor the stability of other programs that are not sharing objects will you consider it at some point? I know that something like this would have to sit in staging for a long time, but I just don't want to build something that has no chance of being merged.
bye michael
On 09/17/2015 07:08 AM, Michael Stefaniuc wrote:
Did you look at memfd as opposed to plain share memory and how that can simplify your memory corruption problem? I know that Michael Müller worked on that and proof of concept code is available, see https://github.com/wine-compholio/wine-staging/tree/master/patches/server-Sh...
I've gotten a bit behind and wasn't aware of create_memfd, so I've had to study up on this -- very interesting (found a nice LWN article https://lwn.net/Articles/593918). Also, I wish that each patch set in staging had a README so that I wouldn't have to analyze the whole patch set to figure out its purpose and theory of operation. This patch set is actually very exciting to me because it addresses some of the biggest problems with excessive wineserver call problems that I've seen in profiling (e.g., PeekMessage & GetForegroundWindow.). However, I still don't quite understand what memfd is buying in this case as opposed to just an mmap with MAP_ANONYMOUS. Please enlighten me if I'm missing something, but does create_memfd and sealing with SEAL_SHRINK & SEAL_GROW just prevent clients from being able to change the size of the underlying file? It looks like it's still writable and still up to the client to mmap it as read-only. I actually haven't gotten to play with the low level bits of shared memory too much yet. (any good reading material on this topic welcomed! :)
Anyway, in this case it looks like we're using the TEB's Reserved5[0] to point to a global share memory region and Reserved5[1] for a thread-specific one, but they are both mmaped read-only. I know that in this thread we've discussed a lot of different design ideas, but what I'm specifically discussing in *this* (most recent) proposal is read/write memory that both wineserver and clients can perform atomic operations on and that client threads and processes can futex each other with. As such, a rogue process can corrupt it. The concept here is a.) limiting the scope of possible corruption to other processes that are already sharing objects anyway (and would very likely be susceptible to the same corruption problems on windows as well) and b.) assuring that the wineserver is able to positively identify misuse of objects in this shared memory.
The basic concept for this is that each client will keep track of changes it has made to an object with a simple (32 or 64 bit) audit value -- e.g., for each time it releases a semaphore, it will add one to it, each time it completes a wait it will subtract one. The wineserver will periodically flip a bit in the shared portion of a synchronization object using an atomic fetch-and-add operation. Clients performing subsequent operations will then begin to log to an alternate buffer. The server will then request each processes' total change count for the last audit period. By adding them all together with the value of the object in the previous audit cycle, the server can verify that the memory has not been corrupted. If it has, it can simply emit an error message to the console, mark the object as "bad" and fail all subsequent and in-process operations on it (waits return WAIT_FAILED with an appropriate last error).
Daniel