Re: Speeding up wineserver syncronization objects with shared memory - wine-devel

7 Mar 2001


      Gavriel State gav@transgaming.com wrote:
...
As it stands, your approach won't be useful for general Wine usage until
you've got *everything* done.
True. But I think there are valid reasons for doing it this way.
One major problem is handles. Either the kernel must allocate all handles, or
userspace must allocate all handles. Take something like the implementation of
DuplicateHandle() - dead easy really in kernel space, including duplication to
another process.
If userspace tells the kernel to use handle X for a mutex, say, and handle Y
for a semaphore, then userspace still has to go to atomic management code to
(a) allocate a handle, and (b) access a handle. And to implement cross-process
duplication, you either need some sort of interrupt & IPC mechanism, or an
external process (the wineserver). This being the case, you lose any gains by
putting the stuff in-kernel.
Alternatively, if userspace is allowed to implement arbitrary handles that are
allocated by the kernel, then the kernel waiting mechanism can get a little
tricky. However, that said, for non-waitable objects, this might not be so
bad. In fact, I'm planning on implementing the registry access handles this
way and may well do some of the process and thread control this way too.
Furthermore, the kernel would have to be given object handles not object
pointers, otherwise you have a gaping security hole.
And another major problem is context switches... they are horribly expensive
really. They currently make Wine with at least 20 times the system call
latency that the kernel method is capable of.
...
And then there will be boatloads of debugging to do.
Perhaps not as much as you think... I'm doing as much debugging as I can as I
go along, just making sure the kernel objects work as I'd expect (not
necessarily the same thing, true, as having "compatible" behaviour).
...
If there's any way that you can implement your kernel module more within the
context of the existing server architecture - replacing objects on a piece
by piece fashion rather than all at once - that might make it easier to
adopt.
See above for why the piece-by-piece method is difficult.
One alternative would be to invent a new network protocol (say AF_WINE), but
that again requires a complete implementation before it is really useful.
...
For example, you might try implementing the core object/handle management
and waiting code in the kernel module, and have the wine server rely on the
kernel module for that low-level functionality.
By this, do you mean actually having the wineserver process talk to the kernel
module on behalf of the Wine application?
...
Waiting for objects would be implemented on the client side through a call
to the kernel module.
That's what I'm currently doing, though it's not fully implemented yet.
...
When something needs to be done with an object, we would call either the
wineserver or the kernel module, depending on how that object is
implemented.
...
For example, you could do mutexes entirely within the kernel module, but
leave file objects on the wine-server side initially.
My main gripe is the slow speed of access to files... Every Read/WriteFile
goes to the wineserver to convert the handle into a file descriptor and to
check for locking. The FD is then passed back over a UNIX domain socket, used
once and then closed.
I suspect it can't really be done otherwise, particularly if ZwCloseObject (or
whatever it is called) is implemented, since this allows handles in another
process to be closed.
Actually, I've done a fair amount of the file object stuff... Most of it
involves mapping down to a "struct file *", which is how the kernel views
files, and then invoking appropriate kernel method.
...
In my experience Alexandre far prefers incremental change to kind of
approach you're taking.  Using that kind of approach will improve the
chances that your code will make it into Wine at some point.
Hmmm... It's difficult to determine how to do it incrementally without making
for even more work, but I think I know what you mean.
...
One thing I've been wondering that you might be able to answer is this:
exactly why is the current Wine Server architecture so slow?  Is it just the
context switching?
Context switching is the main element of it. Going to the wineserver and back
again just for a ReadFile() call or a Wait*() function incurs a fairly serious
penalty (particularly on an X86, I think). Plus there's no requirement for the
kernel to pass the remains of your timeslice to the wineserver and back
again. Also, you have to bundle lots of data through AF_UNIX network packets
and/or copy lots of data into and out of _shared_ memory without killing other
threads.
One of the problems with the context switch is that you have to flush all the
CPU caches, muck around with the MMU and execute scheduling algorithms.
...
Is it that the kernel isn't giving the wineserver a high enough priority
once the client blocks after having written to the socket?
I don't think priority has anything much to do with it. A more convenient
scheduling algorithm might help a little, though.
...
Is it other socket overhead (routing, perhaps)? Simply speeding up the
communications path between the clients and the server would remove the need
for most of the kernel level services.
I don't think so... To communicate with the wineserver you have to use some
sort of waitable UNIX object (ie: a socket, a pipe, or a SYSV semaphore or
message), do busy waiting (& kill your CPU), or send signals (context switch
_and_ signal overhead).
To do it without a wineserver (ie: using shared memory) is also tricky... You
have to be able to, for instance, recover from processes going away without
releasing mutexes.
...
Also, just FYI, the PE image mapping work is nice, but isn't likely to affect
speed all that much, since Wine can just mmap in most PE images created with
recent compilers.  WordPerfect, for example, launches in around 10-12 seconds
on my machine (down from 45-60 seconds before the mmaping).  I can't imagine
that doing the fixups when paging instead would do too much better.
Look at the VM size of something like MS Word, and think of not having to
allocate buffers to store all those DLLs, thus eating a massive chunk out of
your machines total VM.
Plus, it should be even quicker launching because fixups don't have to be done
unless they're necessary. You certainly don't have to go through an fixup a
few tens of megs of DLL.
Cheers,
David