Hi everyone,
We've recently been working on getting American McGee's Alice (a visually stunning game, if you haven't seen it before) running well under Wine, and we've run into a serious speed issue with synchronization objects like Mutexes.
Currently, Alice runs at about 50% the framerate it gets in Windows with the same graphics driver (NVidia). When we started investigating, it turned out that the reason for this is that it's spending half of it's time in the WineServer. At first we assumed that this was due to the fact that the GL thunks need to grab the X11 lock. We realized that this wasn't necessary for most GL calls if we're using a direct rendering GL implementation, and turned off the locks. There was no effect - because there really wasn't much contention for the x11 lock.
After going through a number of similar Wine internal possibilities and getting nowhere, we finally realized that the problem was the app itself. It's grabbing and releasing a mutex of it's own bazillions of times each frame. Since there's nothing much we can do about that we started thinking about the proposed linux kernel module approach. After re-reading the thread and looking over the prototype, I have to concur with Alexandre's judgement - the prototype that exists is trying to do too much work.
After some more thinking, Ove and I have come up with a mechanism that should eliminate most of the wineserver overhead for mutexes and semaphores, without the need to resort to a kernel module. We're probably going to give this a try over the next few days, so any feedback will be very much appreciated.
Here's what we've been discussing in private email:
============================================================================================ Ove writes:
Gav writes:
Alternatively, I wonder if there's some way to speed up synchronization stuff through the use of some kind of shared memory area that all wine processes know about. The shared memory area could be used to do mutexes with atomic test-and- set operations.
Maybe. But we probably don't want extensive busy waits, so we'd need to call the wineserver when we need to wait. And the wineserver isn't really designed to do bus-locked atomic access to such shared areas itself. But perhaps with some client cooperation... in win32, a mutex is just a different (and slower) kind of a critical section, anyway (but since it's handle-based it can work across address spaces).
If each mutex had a wcount field shared among all clients, we could do...
ReleaseMutex: wc = InterlockedDecrement(&wcount) if wc > 0 call wineserver's ReleaseMutex
WaitForSingleObject: wc = InterlockedIncrement(&wcount) if wc < 0 return WAIT_OBJECT_0 call wineserver's WaitForSingleObject
which would at least do something about the ReleaseMutex/WaitForSingleObject pairs in the same thread...
That's exactly the kind of thing I was thinking about. We can probably do the same for the CriticalSection semaphores as well. I don't think that we can do anything to speed up Events though.
So the next question is: what's the best way to manage the shared area for each mutex/semaphore? We could just expose the wineserver handle table directly in the shared memory area, expanding the handle_entry struct in the server with a DWORD to server as the count field. Theoretically it brings up security concerns, but I don't think that we care that much at this point.
============================================================================================
Thoughts, anyone?
-Gav