Hi all,
I've been recently looking at user handle handling in win32u and how it should be split from user32 (and PE world in general). In general, having something similar to how we handle gdi handles now would be nice. We're able to expose DC_ATTR struct (and possibly others in the future), allowing simple functions like an attribute getter, to avoid any syscalls, while keeping internal 'kernel' part private to win32u.
Unlike gdi handles, user handles need to be really interprocess, which makes things more tricky. Having it working only for current process would not buy us much, because we'd need to expose the fallback part from win32u anyway. I also considered avoiding the problem and simply expose additional Wine-specific functions from win32u. That could work, but such differences in handles capabilities from Windows quickly add up and the number of cases where it would affect design decisions convinced me to just try to make shared memory work. We were talking about extending shared memory usage for wineserver communication anyway.
I created a branch with my prototype of implementation:
https://github.com/jacekcw/wine/tree/user-handles
I think that this is enough to get a sense of how I think it could work. Those patches still need some work and testing, but it's close enough.
It implements user handle table using shared memory that is writeable from server, but read-only for clients. Using shared memory for the handles table alone is enough to reimplement things like IsWindow() and GetWindowThreadProcessId(). Handle table entry contains three objects:
- server object, not much changes here
- client 'kernel' object. Using windows as an example, this is WND struct. We will want to make it private to win32u (instead of user32), but other than that its mechanism can stay mostly unchanged, except we will need to more careful to match server state closely
- shared struct representing current state of the object. This replaces some fields of above types, but will also be accessible from PE code.
user-handles branch reimplements GetWindowLong* on top of shared memory (mostly, it's not complete there, but remaining cases will be similar once complete). This also seems to match how it works on Windows: win32u exposes NtUserSetWindowLong, but no Get* equivalent. The getter can be lock-free because the only race it needs to worry about is if the handle was closed and memory reused (in which case we'd read a garbage). To avoid that, we simply need to recheck the handle after reading the memory. Also note that guarantees of user_section lock still hold. Although this memory is modified on server side, it's always initiated by a server call call from proper thread when SetWindowLong* holds the lock.
There are some similarities to Huw's and Rémi's shared memory patches that are used in Proton, but this patchset touches different areas. user-handles branch also doesn't need locking mechanisms because thread safety is guaranteed by other means. We may need something like serial number-based synchronization when we will need a way to read multiple fields who's integrity cannot be guaranteed differently, but that doesn't seem necessary for cases that I needed so far. Anyway, I believe that those patches can be adopted to use mechanisms from my series, but the overlap is small at this point.
Any feedback is appreciated.
Thanks,
Jacek
On 2/7/22 18:47, Jacek Caban wrote:
I created a branch with my prototype of implementation:
https://github.com/jacekcw/wine/tree/user-handles
I think that this is enough to get a sense of how I think it could work. Those patches still need some work and testing, but it's close enough.
It implements user handle table using shared memory that is writeable from server, but read-only for clients. Using shared memory for the handles table alone is enough to reimplement things like IsWindow() and GetWindowThreadProcessId(). Handle table entry contains three objects:
server object, not much changes here
client 'kernel' object. Using windows as an example, this is WND
struct. We will want to make it private to win32u (instead of user32), but other than that its mechanism can stay mostly unchanged, except we will need to more careful to match server state closely
- shared struct representing current state of the object. This replaces
some fields of above types, but will also be accessible from PE code.
user-handles branch reimplements GetWindowLong* on top of shared memory (mostly, it's not complete there, but remaining cases will be similar once complete). This also seems to match how it works on Windows: win32u exposes NtUserSetWindowLong, but no Get* equivalent. The getter can be lock-free because the only race it needs to worry about is if the handle was closed and memory reused (in which case we'd read a garbage). To avoid that, we simply need to recheck the handle after reading the memory. Also note that guarantees of user_section lock still hold. Although this memory is modified on server side, it's always initiated by a server call call from proper thread when SetWindowLong* holds the lock.
There are some similarities to Huw's and Rémi's shared memory patches that are used in Proton, but this patchset touches different areas. user-handles branch also doesn't need locking mechanisms because thread safety is guaranteed by other means. We may need something like serial number-based synchronization when we will need a way to read multiple fields who's integrity cannot be guaranteed differently, but that doesn't seem necessary for cases that I needed so far. Anyway, I believe that those patches can be adopted to use mechanisms from my series, but the overlap is small at this point.
Any feedback is appreciated.
Hi Jacek,
I had a quick look, although I didn't really look into detail. I really like the idea of having shared memory introduced, with a similar mechanism as for the Proton patches, it makes it possible to think about having them upstream too one day.
Then, I think I would rather use the same locking mechanism for both if possible. I like the way Huw did it with SHM read/write blocks, as it maps quite closely to the SERVER_START_REQ / SERVER_END_REQ pattern, making shared memory communication very visible and explicit (although it could maybe be even better by requiring a scope block, as for server requests).
I also think having atomic_store_ulong spread a bit everywhere is not making the code very readable. It's also not as efficient I believe, as potentially every store needs to lock, whereas the SHM block only requires it on the sequence counter atomic operations.
Also note that the atomic_store helpers name is now misleading, as Jinoh Kang noted in their patch the other day. They aren't atomic at all, only just strongly ordered regarding other volatile qualified variables.
On the same note, and it's also a potential issue with Proton SHM patches, I think we're using and relying on volatile a little bit too much and that may not be ideal. For instance, every shared memory store is done through a volatile qualified struct, and are all strongly ordered together, although the SHM block is already atomic.
Regarding the session shared memory, I'm a bit worried about the allocator. I think it'd be much simpler using fixed size blocks and a freelist, especially as it looks like we only need to store window shared data there, for the moment at least. If we need very differently sized blocks, then adding pools is also probably simpler.
Cheers,
Hi Rémi,
On 2/14/22 22:46, Rémi Bernon wrote:
Hi Jacek,
I had a quick look, although I didn't really look into detail. I really like the idea of having shared memory introduced, with a similar mechanism as for the Proton patches, it makes it possible to think about having them upstream too one day.
Thanks for looking at it.
Then, I think I would rather use the same locking mechanism for both if possible. I like the way Huw did it with SHM read/write blocks, as it maps quite closely to the SERVER_START_REQ / SERVER_END_REQ pattern, making shared memory communication very visible and explicit (although it could maybe be even better by requiring a scope block, as for server requests).
I agree that we will need a mechanism like that for some cases, but it doesn't seem like a mechanism we'd want to use for everything. For example, handle table already has its own mechanism compatible with Windows (the uniq field) and doesn't need any more locking. I believe that having an extra lock (global to handle table?) would be redundant and complicate the code.
On client side, there are a number of cases where we don't need any extra shared memory lock. For example, when we own user lock anyway, we don't need any additional locking to access its shared data (at least not for most things).
Some locking of user data will be needed for cases where we don't have user lock and will need more than one field. I'm not sure yet, but it seems likely that we may want to have a per-window lock. In this case, we'd need a separated mechanism (like handle validation) to make sure that window owning the lock is not destroyed. That would require a bit more complicated equivalent of SHARED_READ_BEGIN.
I also think having atomic_store_ulong spread a bit everywhere is not making the code very readable. It's also not as efficient I believe, as potentially every store needs to lock, whereas the SHM block only requires it on the sequence counter atomic operations.
In many cases, like SetWindowLong, we just need one store, so sequence counter adds an extra store. Anyway, I would be surprised if you could measure any difference, especially on x86, where it's just a regular store.
On the same note, and it's also a potential issue with Proton SHM patches, I think we're using and relying on volatile a little bit too much and that may not be ideal. For instance, every shared memory store is done through a volatile qualified struct, and are all strongly ordered together, although the SHM block is already atomic.
Same as above, I don't think it's a problem. (FWIW, in my patches, I already do non-"atomic" stores in some places where strong ordering is not needed).
Regarding the session shared memory, I'm a bit worried about the allocator. I think it'd be much simpler using fixed size blocks and a freelist, especially as it looks like we only need to store window shared data there, for the moment at least. If we need very differently sized blocks, then adding pools is also probably simpler.
I didn't include that in the branch, but window data also contains window extra bytes.
Thanks,
Jacek