This patch series introduces a new char misc driver, /dev/ntsync, which is used to implement Windows NT synchronization primitives.
== Background ==
The Wine project emulates the Windows API in user space. One particular part of that API, namely the NT synchronization primitives, have historically been implemented via RPC to a dedicated "kernel" process. However, more recent applications use these APIs more strenuously, and the overhead of RPC has become a bottleneck.
The NT synchronization APIs are too complex to implement on top of existing primitives without sacrificing correctness. Certain operations, such as NtPulseEvent() or the "wait-for-all" mode of NtWaitForMultipleObjects(), require direct control over the underlying wait queue, and implementing a wait queue sufficiently robust for Wine in user space is not possible. This proposed driver, therefore, implements the problematic interfaces directly in the Linux kernel.
This driver was presented at Linux Plumbers Conference 2023. For those further interested in the history of synchronization in Wine and past attempts to solve this problem in user space, a recording of the presentation can be viewed here:
https://www.youtube.com/watch?v=NjU4nyWyhU8
== Performance ==
The gain in performance varies wildly depending on the application in question and the user's hardware. For some games NT synchronization is not a bottleneck and no change can be observed, but for others frame rate improvements of 50 to 150 percent are not atypical. The following table lists frame rate measurements from a variety of games on a variety of hardware, taken by users Dmitry Skvortsov, FuzzyQuills, OnMars, and myself:
Game Upstream ntsync improvement =========================================================================== Anger Foot 69 99 43% Call of Juarez 99.8 224.1 125% Dirt 3 110.6 860.7 678% Forza Horizon 5 108 160 48% Lara Croft: Temple of Osiris 141 326 131% Metro 2033 164.4 199.2 21% Resident Evil 2 26 77 196% The Crew 26 51 96% Tiny Tina's Wonderlands 130 360 177% Total War Saga: Troy 109 146 34% ===========================================================================
== Patches ==
This is the first part of a 32-patch series. The series comprises 17 patches which contain the actual implementation, 13 which provide self-tests, 1 to update the MAINTAINERS file, and 1 to add API documentation.
The intended semantics of the patches are broadly intended to match those of the corresponding Windows functions. Since I do not expect familiarity with Windows syscalls, however, and especially not with some of the more subtle or unspecified behaviour that they provide, the documentation patch included in the series also describes the intended behaviour in detail, and can be used as a specification for the rest of the series.
The entire series can be retrieved or browsed here:
https://repo.or.cz/linux/zf.git/shortlog/refs/heads/ntsync4
The patches making use of this driver in Wine can be retrieved or browsed here:
https://repo.or.cz/wine/zf.git/shortlog/refs/heads/ntsync4
== Implementation ==
Some aspects of the implementation may deserve particular comment:
* In the interest of performance, each object is governed only by a single spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of multiple objects be changed as a single atomic operation. In order to achieve this, we first take a device-wide lock ("wait_all_lock") any time we are going to lock more than one object at a time.
The maximum number of objects that can be used in a vectored wait, and therefore the maximum that can be locked simultaneously, is 64. This number is NT's own limit.
The acquisition of multiple spinlocks will degrade performance. This is a conscious choice, however. Wait-for-all is known to be a very rare operation in practice, especially with counts that approach the maximum, and it is the intent of the ntsync driver to optimize the wait-for-any pattern at the expense of the wait-for-all pattern as much as possible.
* NT mutexes are tied to their threads on an OS level, and the kernel includes builtin support for "robust" mutexes. In order to keep the ntsync driver self-contained and avoid touching more code than necessary, it does not hook into task exit nor use pids.
Instead, the user space emulator is expected to manage thread IDs and pass them as an argument to any relevant functions; this is the "owner" field of ntsync_wait_args and ntsync_mutex_args.
When the emulator detects that a thread dies, it should therefore call NTSYNC_IOC_KILL_OWNER, which will mark mutexes owned by that thread (if any) as abandoned.
* This implementation uses a misc device mostly because it seemed like the simplest and least obtrusive option.
Besides simplicitly of implementation, the only particularly interesting advantage is the ability to create an arbitrary number of "contexts" (corresponding to Windows virtual machines) which are self-contained and shareable across multiple processes; this maps nicely to file descriptions (i.e. struct file). This is not impossible with syscalls of course but would require an extra argument.
On the other hand, there is no reason to forbid using ntsync by default from user-mode processes, and (as far as I understand) to do so with a char device requires explicit configuration by e.g. udev or init. Since this is done with e.g. fuse, I assume this is the model to follow, but I may have chosen something deprecated.
* ntsync is module-capable mostly because there was nothing preventing it, and because it aided development. I am not aware of any reason why being a module is required, though.
* The misc minor number has not been reserved with LANANA. I am not sure at what point in the process this makes the most sense, but since this is still only an RFC I've abstained from doing so yet.
Elizabeth Figura (9): ntsync: Introduce the ntsync driver and character device. ntsync: Reserve a minor device number and ioctl range. ntsync: Introduce NTSYNC_IOC_CREATE_SEM and NTSYNC_IOC_DELETE. ntsync: Introduce NTSYNC_IOC_PUT_SEM. ntsync: Introduce NTSYNC_IOC_WAIT_ANY. ntsync: Introduce NTSYNC_IOC_WAIT_ALL. ntsync: Introduce NTSYNC_IOC_CREATE_MUTEX. ntsync: Introduce NTSYNC_IOC_PUT_MUTEX. ntsync: Introduce NTSYNC_IOC_KILL_OWNER.
Documentation/admin-guide/devices.txt | 3 +- .../userspace-api/ioctl/ioctl-number.rst | 2 + drivers/misc/Kconfig | 9 + drivers/misc/Makefile | 1 + drivers/misc/ntsync.c | 916 ++++++++++++++++++ include/linux/miscdevice.h | 1 + include/uapi/linux/ntsync.h | 53 + 7 files changed, 984 insertions(+), 1 deletion(-) create mode 100644 drivers/misc/ntsync.c create mode 100644 include/uapi/linux/ntsync.h
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
ntsync uses a misc device as the simplest and least intrusive uAPI interface.
Each file description on the device represents an isolated NT instance, intended to correspond to a single NT virtual machine.
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/Kconfig | 9 ++++++++ drivers/misc/Makefile | 1 + drivers/misc/ntsync.c | 53 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 63 insertions(+) create mode 100644 drivers/misc/ntsync.c
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index 4fb291f0bf7c..bdd8a71bd853 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -504,6 +504,15 @@ config OPEN_DICE measured boot flow. Userspace can use CDIs for remote attestation and sealing.
+config NTSYNC + tristate "NT synchronization primitive emulation" + help + This module provides kernel support for emulation of Windows NT + synchronization primitives. It is not a hardware driver. + + To compile this driver as a module, choose M here: the + module will be called ntsync. + If unsure, say N.
config VCPU_STALL_DETECTOR diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile index ea6ea5bbbc9c..153a3f4837e8 100644 --- a/drivers/misc/Makefile +++ b/drivers/misc/Makefile @@ -59,6 +59,7 @@ obj-$(CONFIG_PVPANIC) += pvpanic/ obj-$(CONFIG_UACCE) += uacce/ obj-$(CONFIG_XILINX_SDFEC) += xilinx_sdfec.o obj-$(CONFIG_HISI_HIKEY_USB) += hisi_hikey_usb.o +obj-$(CONFIG_NTSYNC) += ntsync.o obj-$(CONFIG_HI6421V600_IRQ) += hi6421v600-irq.o obj-$(CONFIG_OPEN_DICE) += open-dice.o obj-$(CONFIG_GP_PCI1XXXX) += mchp_pci1xxxx/ diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c new file mode 100644 index 000000000000..9424c6210e51 --- /dev/null +++ b/drivers/misc/ntsync.c @@ -0,0 +1,53 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * ntsync.c - Kernel driver for NT synchronization primitives + * + * Copyright (C) 2021-2022 Elizabeth Figura + */ + +#include <linux/fs.h> +#include <linux/miscdevice.h> +#include <linux/module.h> + +#define NTSYNC_NAME "ntsync" + +static int ntsync_char_open(struct inode *inode, struct file *file) +{ + return nonseekable_open(inode, file); +} + +static int ntsync_char_release(struct inode *inode, struct file *file) +{ + return 0; +} + +static long ntsync_char_ioctl(struct file *file, unsigned int cmd, + unsigned long parm) +{ + switch (cmd) { + default: + return -ENOIOCTLCMD; + } +} + +static const struct file_operations ntsync_fops = { + .owner = THIS_MODULE, + .open = ntsync_char_open, + .release = ntsync_char_release, + .unlocked_ioctl = ntsync_char_ioctl, + .compat_ioctl = ntsync_char_ioctl, + .llseek = no_llseek, +}; + +static struct miscdevice ntsync_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = NTSYNC_NAME, + .fops = &ntsync_fops, +}; + +module_misc_device(ntsync_misc); + +MODULE_AUTHOR("Elizabeth Figura"); +MODULE_DESCRIPTION("Kernel driver for NT synchronization primitives"); +MODULE_LICENSE("GPL"); +MODULE_ALIAS("devname:" NTSYNC_NAME);
On Tue, Jan 23, 2024 at 4:59 PM Elizabeth Figura zfigura@codeweavers.com wrote:
ntsync uses a misc device as the simplest and least intrusive uAPI interface.
Each file description on the device represents an isolated NT instance, intended to correspond to a single NT virtual machine.
If I understand this text right, and if I understood the code right, you're saying that each open instance of the device represents an entire universe of NT synchronization objects, and no security or isolation is possible between those objects. For single-process use, this seems fine. But fork() will be a bit odd (although NT doesn't really believe in fork, so maybe this is fine).
Except that NT has *named* semaphores and such. And I'm pretty sure I've written GUI programs that use named synchronization objects (IIRC they were events, and this was a *very* common pattern, regularly discussed in MSDN, usenet, etc) to detect whether another instance of the program is running. And this all works on real Windows because sessions have sufficiently separated namespaces, and the security all works out about as any other security on Windows, etc. But implementing *that* on top of this file-description-plus-integer-equals-object will be fundamentally quite subject to one buggy program completely clobbering someone else's state.
Would it make sense and scale appropriately for an NT synchronization *object* to be a Linux open file description? Then SCM_RIGHTS could pass them around, an RPC server could manage *named* objects, and they'd generally work just like other "Object Manager" objects like, say, files.
--Andy
On Wednesday, 24 January 2024 15:26:15 CST Andy Lutomirski wrote:
On Tue, Jan 23, 2024 at 4:59 PM Elizabeth Figura zfigura@codeweavers.com wrote:
ntsync uses a misc device as the simplest and least intrusive uAPI interface.
Each file description on the device represents an isolated NT instance, intended to correspond to a single NT virtual machine.
If I understand this text right, and if I understood the code right, you're saying that each open instance of the device represents an entire universe of NT synchronization objects, and no security or isolation is possible between those objects. For single-process use, this seems fine. But fork() will be a bit odd (although NT doesn't really believe in fork, so maybe this is fine).
Except that NT has *named* semaphores and such. And I'm pretty sure I've written GUI programs that use named synchronization objects (IIRC they were events, and this was a *very* common pattern, regularly discussed in MSDN, usenet, etc) to detect whether another instance of the program is running. And this all works on real Windows because sessions have sufficiently separated namespaces, and the security all works out about as any other security on Windows, etc. But implementing *that* on top of this file-description-plus-integer-equals-object will be fundamentally quite subject to one buggy program completely clobbering someone else's state.
Would it make sense and scale appropriately for an NT synchronization *object* to be a Linux open file description? Then SCM_RIGHTS could pass them around, an RPC server could manage *named* objects, and they'd generally work just like other "Object Manager" objects like, say, files.
It's a sensible concern. I think when I discussed this with Alexandre Julliard (the Wine maintainer, CC'd) the conclusion was this wasn't something we were concerned about.
While the current model *does* allow for processes to arbitrarily mess with each other, accidentally or not, I think we're not concerned with the scope of that than we are about implementing a whole scheduler in user space.
For one, you can't corrupt the wineserver state this way—wineserver being sort of like a dedicated process that handles many of the things that a kernel would, and so sometimes needs to set or reset events, or perform NTSYNC_IOC_KILL_MUTEX, but never relies on ntsync object state. Whereas trying to implement a scheduler in user space would involve the wineserver taking locks, and hence other processes could deadlock.
For two, it's probably a lot harder to mess with that internal state accidentally.
[There is also a potential problem where some broken applications create a million (literally) sync objects. Making these into files runs into NOFILE. We did specifically push distributions and systemd to increase those limits because an older solution *did* use eventfds and *did* run into those limits. Since that push was successful I don't know if this is *actually* a concern anymore, but avoiding files is probably not a bad thing either.]
--Zeb
On Wednesday, 24 January 2024 16:56:23 CST Elizabeth Figura wrote:
On Wednesday, 24 January 2024 15:26:15 CST Andy Lutomirski wrote:
On Tue, Jan 23, 2024 at 4:59 PM Elizabeth Figura zfigura@codeweavers.com wrote:
ntsync uses a misc device as the simplest and least intrusive uAPI interface.
Each file description on the device represents an isolated NT instance, intended to correspond to a single NT virtual machine.
If I understand this text right, and if I understood the code right, you're saying that each open instance of the device represents an entire universe of NT synchronization objects, and no security or isolation is possible between those objects. For single-process use, this seems fine. But fork() will be a bit odd (although NT doesn't really believe in fork, so maybe this is fine).
Except that NT has *named* semaphores and such. And I'm pretty sure I've written GUI programs that use named synchronization objects (IIRC they were events, and this was a *very* common pattern, regularly discussed in MSDN, usenet, etc) to detect whether another instance of the program is running. And this all works on real Windows because sessions have sufficiently separated namespaces, and the security all works out about as any other security on Windows, etc. But implementing *that* on top of this file-description-plus-integer-equals-object will be fundamentally quite subject to one buggy program completely clobbering someone else's state.
Would it make sense and scale appropriately for an NT synchronization *object* to be a Linux open file description? Then SCM_RIGHTS could pass them around, an RPC server could manage *named* objects, and they'd generally work just like other "Object Manager" objects like, say, files.
It's a sensible concern. I think when I discussed this with Alexandre Julliard (the Wine maintainer, CC'd) the conclusion was this wasn't something we were concerned about.
While the current model *does* allow for processes to arbitrarily mess with each other, accidentally or not, I think we're not concerned with the scope of that than we are about implementing a whole scheduler in user space.
For one, you can't corrupt the wineserver state this way—wineserver being sort of like a dedicated process that handles many of the things that a kernel would, and so sometimes needs to set or reset events, or perform NTSYNC_IOC_KILL_MUTEX, but never relies on ntsync object state. Whereas trying to implement a scheduler in user space would involve the wineserver taking locks, and hence other processes could deadlock.
For two, it's probably a lot harder to mess with that internal state accidentally.
[There is also a potential problem where some broken applications create a million (literally) sync objects. Making these into files runs into NOFILE. We did specifically push distributions and systemd to increase those limits because an older solution *did* use eventfds and *did* run into those limits. Since that push was successful I don't know if this is *actually* a concern anymore, but avoiding files is probably not a bad thing either.]
Of course, looking at it from a kernel maintainer's perspective, it wouldn't be insane to do this anyway. If we at some point do start to care about cross- process isolation in this way, or if another NT emulator wants to use this interface and does care about cross-process isolation, it'll be necessary. At least it'd make sense to make them separate files even if we don't implement granular permission handling just yet.
The main question is, is NOFILE a realistic concern, and what other problems might there be, in terms of making these heavier objects? Besides memory usage I can't think of any, but of course I don't have much knowledge of this area.
Alternatively, maybe there's another more lightweight way to store per-process data?
On Wed, Jan 24, 2024 at 7:42 PM Elizabeth Figura zfigura@codeweavers.com wrote:
On Wednesday, 24 January 2024 16:56:23 CST Elizabeth Figura wrote:
On Wednesday, 24 January 2024 15:26:15 CST Andy Lutomirski wrote:
On Tue, Jan 23, 2024 at 4:59 PM Elizabeth Figura zfigura@codeweavers.com wrote:
ntsync uses a misc device as the simplest and least intrusive uAPI interface.
Each file description on the device represents an isolated NT instance, intended to correspond to a single NT virtual machine.
If I understand this text right, and if I understood the code right, you're saying that each open instance of the device represents an entire universe of NT synchronization objects, and no security or isolation is possible between those objects. For single-process use, this seems fine. But fork() will be a bit odd (although NT doesn't really believe in fork, so maybe this is fine).
Except that NT has *named* semaphores and such. And I'm pretty sure I've written GUI programs that use named synchronization objects (IIRC they were events, and this was a *very* common pattern, regularly discussed in MSDN, usenet, etc) to detect whether another instance of the program is running. And this all works on real Windows because sessions have sufficiently separated namespaces, and the security all works out about as any other security on Windows, etc. But implementing *that* on top of this file-description-plus-integer-equals-object will be fundamentally quite subject to one buggy program completely clobbering someone else's state.
Would it make sense and scale appropriately for an NT synchronization *object* to be a Linux open file description? Then SCM_RIGHTS could pass them around, an RPC server could manage *named* objects, and they'd generally work just like other "Object Manager" objects like, say, files.
It's a sensible concern. I think when I discussed this with Alexandre Julliard (the Wine maintainer, CC'd) the conclusion was this wasn't something we were concerned about.
While the current model *does* allow for processes to arbitrarily mess with each other, accidentally or not, I think we're not concerned with the scope of that than we are about implementing a whole scheduler in user space.
For one, you can't corrupt the wineserver state this way—wineserver being sort of like a dedicated process that handles many of the things that a kernel would, and so sometimes needs to set or reset events, or perform NTSYNC_IOC_KILL_MUTEX, but never relies on ntsync object state. Whereas trying to implement a scheduler in user space would involve the wineserver taking locks, and hence other processes could deadlock.
For two, it's probably a lot harder to mess with that internal state accidentally.
[There is also a potential problem where some broken applications create a million (literally) sync objects. Making these into files runs into NOFILE. We did specifically push distributions and systemd to increase those limits because an older solution *did* use eventfds and *did* run into those limits. Since that push was successful I don't know if this is *actually* a concern anymore, but avoiding files is probably not a bad thing either.]
Of course, looking at it from a kernel maintainer's perspective, it wouldn't be insane to do this anyway. If we at some point do start to care about cross- process isolation in this way, or if another NT emulator wants to use this interface and does care about cross-process isolation, it'll be necessary. At least it'd make sense to make them separate files even if we don't implement granular permission handling just yet.
I'm not convinced that any complexity at all beyond using individual files is needed for granular permission handling. Unless something actually needs permission bits on different files pointing at the same sync object (which I believe NT supports, but it's sort of an odd concept and I'm not immediately convinced that anything uses it), merely having individual files ought to do the trick. Handling of who has permission to open a given named object can live in a daemon, and I'd guess that Wine even already implements this.
And keeping everything together gives me flashbacks of Windows 95 and Mac OS pre-X. Sure, in principle the software wasn't malicious, but there was no shortage whatsoever of buggy crap out there, and systems were quite unstable. Even just:
CreateSemaphore(); fork(); sleep a few seconds; exit();
seems like it could corrupt the shared namespace world. (Obviously no one would ever do that, right?)
Also, handle leaks:
while(true) { make a subprocess, which creates a semaphore and crashes; }
The main question is, is NOFILE a realistic concern, and what other problems might there be, in terms of making these heavier objects? Besides memory usage I can't think of any, but of course I don't have much knowledge of this area.
Years ago there was some discussion of making struct file lighter-weight for light-weight things that aren't files. And, in any case, even the little integer indices in your code aren't free -- they just aren't accounted as files.
And struct file isn't *that* bad. I bet it's not dramatically bigger, or even smaller, than whatever the Windows kernel stores for a semaphore handle.
--Andy
On Thursday, 25 January 2024 12:55:04 CST Andy Lutomirski wrote:
On Wed, Jan 24, 2024 at 7:42 PM Elizabeth Figura zfigura@codeweavers.com wrote:
On Wednesday, 24 January 2024 16:56:23 CST Elizabeth Figura wrote:
On Wednesday, 24 January 2024 15:26:15 CST Andy Lutomirski wrote:
On Tue, Jan 23, 2024 at 4:59 PM Elizabeth Figura zfigura@codeweavers.com wrote:
ntsync uses a misc device as the simplest and least intrusive uAPI interface.
Each file description on the device represents an isolated NT instance, intended to correspond to a single NT virtual machine.
If I understand this text right, and if I understood the code right, you're saying that each open instance of the device represents an entire universe of NT synchronization objects, and no security or isolation is possible between those objects. For single-process use, this seems fine. But fork() will be a bit odd (although NT doesn't really believe in fork, so maybe this is fine).
Except that NT has *named* semaphores and such. And I'm pretty sure I've written GUI programs that use named synchronization objects (IIRC they were events, and this was a *very* common pattern, regularly discussed in MSDN, usenet, etc) to detect whether another instance of the program is running. And this all works on real Windows because sessions have sufficiently separated namespaces, and the security all works out about as any other security on Windows, etc. But implementing *that* on top of this file-description-plus-integer-equals-object will be fundamentally quite subject to one buggy program completely clobbering someone else's state.
Would it make sense and scale appropriately for an NT synchronization *object* to be a Linux open file description? Then SCM_RIGHTS could pass them around, an RPC server could manage *named* objects, and they'd generally work just like other "Object Manager" objects like, say, files.
It's a sensible concern. I think when I discussed this with Alexandre Julliard (the Wine maintainer, CC'd) the conclusion was this wasn't something we were concerned about.
While the current model *does* allow for processes to arbitrarily mess with each other, accidentally or not, I think we're not concerned with the scope of that than we are about implementing a whole scheduler in user space.
For one, you can't corrupt the wineserver state this way—wineserver being sort of like a dedicated process that handles many of the things that a kernel would, and so sometimes needs to set or reset events, or perform NTSYNC_IOC_KILL_MUTEX, but never relies on ntsync object state. Whereas trying to implement a scheduler in user space would involve the wineserver taking locks, and hence other processes could deadlock.
For two, it's probably a lot harder to mess with that internal state accidentally.
[There is also a potential problem where some broken applications create a million (literally) sync objects. Making these into files runs into NOFILE. We did specifically push distributions and systemd to increase those limits because an older solution *did* use eventfds and *did* run into those limits. Since that push was successful I don't know if this is *actually* a concern anymore, but avoiding files is probably not a bad thing either.]
Of course, looking at it from a kernel maintainer's perspective, it wouldn't be insane to do this anyway. If we at some point do start to care about cross- process isolation in this way, or if another NT emulator wants to use this interface and does care about cross-process isolation, it'll be necessary. At least it'd make sense to make them separate files even if we don't implement granular permission handling just yet.
I'm not convinced that any complexity at all beyond using individual files is needed for granular permission handling. Unless something actually needs permission bits on different files pointing at the same sync object (which I believe NT supports, but it's sort of an odd concept and I'm not immediately convinced that anything uses it), merely having individual files ought to do the trick. Handling of who has permission to open a given named object can live in a daemon, and I'd guess that Wine even already implements this.
This is mostly correct. NT has file descriptors and descriptions (the former is called a "handle"), though unlike Unix access bits are specific to the *descriptor* (handle). I don't know if anything uses it, but we do currently implement that basic functionality, so I can't say that nothing does either.
So inasmuch as access to someone else's object is a concern, access to your object with bits you don't have permission for could be a concern along the same lines. However, from conversation with Alexandre I believe it'd be fine to just implement those checks in user space.
And keeping everything together gives me flashbacks of Windows 95 and Mac OS pre-X. Sure, in principle the software wasn't malicious, but there was no shortage whatsoever of buggy crap out there, and systems were quite unstable. Even just:
CreateSemaphore(); fork(); sleep a few seconds; exit();
seems like it could corrupt the shared namespace world. (Obviously no one would ever do that, right?)
Also, handle leaks:
while(true) { make a subprocess, which creates a semaphore and crashes; }
For whatever it's worth, this particular thing wouldn't be a concern; Wine's "kernel" daemon already has to detect when a process dies and close all its outstanding handles.
Elizabeth Figura zfigura@codeweavers.com writes:
On Wednesday, 24 January 2024 15:26:15 CST Andy Lutomirski wrote:
On Tue, Jan 23, 2024 at 4:59 PM Elizabeth Figura zfigura@codeweavers.com wrote:
ntsync uses a misc device as the simplest and least intrusive uAPI interface.
Each file description on the device represents an isolated NT instance, intended to correspond to a single NT virtual machine.
If I understand this text right, and if I understood the code right, you're saying that each open instance of the device represents an entire universe of NT synchronization objects, and no security or isolation is possible between those objects. For single-process use, this seems fine. But fork() will be a bit odd (although NT doesn't really believe in fork, so maybe this is fine).
Except that NT has *named* semaphores and such. And I'm pretty sure I've written GUI programs that use named synchronization objects (IIRC they were events, and this was a *very* common pattern, regularly discussed in MSDN, usenet, etc) to detect whether another instance of the program is running. And this all works on real Windows because sessions have sufficiently separated namespaces, and the security all works out about as any other security on Windows, etc. But implementing *that* on top of this file-description-plus-integer-equals-object will be fundamentally quite subject to one buggy program completely clobbering someone else's state.
Would it make sense and scale appropriately for an NT synchronization *object* to be a Linux open file description? Then SCM_RIGHTS could pass them around, an RPC server could manage *named* objects, and they'd generally work just like other "Object Manager" objects like, say, files.
It's a sensible concern. I think when I discussed this with Alexandre Julliard (the Wine maintainer, CC'd) the conclusion was this wasn't something we were concerned about.
While the current model *does* allow for processes to arbitrarily mess with each other, accidentally or not, I think we're not concerned with the scope of that than we are about implementing a whole scheduler in user space.
I may have misunderstood something in that dicussion then, because it would definitely be a concern. It's OK for a process to be able to mess up the state of any object that it has an NT handle to, but it shouldn't be possible to mess up the state of unrelated objects in other processes simply by passing the wrong integer id.
The concern is not so much about a malicious process going out of its way to corrupt others, because it could do that through the NT API just as well. But if a wayward pointer corrupts the client-side handle cache, that shouldn't take down the entire session.
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- Documentation/admin-guide/devices.txt | 3 ++- Documentation/userspace-api/ioctl/ioctl-number.rst | 2 ++ drivers/misc/ntsync.c | 3 ++- include/linux/miscdevice.h | 1 + 4 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt index 94c98be1329a..041404397ee5 100644 --- a/Documentation/admin-guide/devices.txt +++ b/Documentation/admin-guide/devices.txt @@ -376,8 +376,9 @@ 240 = /dev/userio Serio driver testing device 241 = /dev/vhost-vsock Host kernel driver for virtio vsock 242 = /dev/rfkill Turning off radio transmissions (rfkill) + 243 = /dev/ntsync NT synchronization primitive device
- 243-254 Reserved for local use + 244-254 Reserved for local use 255 Reserved for MISC_DYNAMIC_MINOR
11 char Raw keyboard device (Linux/SPARC only) diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 457e16f06e04..a1326a5bc2e0 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -378,6 +378,8 @@ Code Seq# Include File Comments mailto:thomas@winischhofer.net 0xF6 all LTTng Linux Trace Toolkit Next Generation mailto:mathieu.desnoyers@efficios.com +0xF7 00-1F uapi/linux/ntsync.h NT synchronization primitives + mailto:wine-devel@winehq.org 0xF8 all arch/x86/include/uapi/asm/amd_hsmp.h AMD HSMP EPYC system management interface driver mailto:nchatrad@amd.com 0xFD all linux/dm-ioctl.h diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index 9424c6210e51..84b498e2b2d5 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -40,7 +40,7 @@ static const struct file_operations ntsync_fops = { };
static struct miscdevice ntsync_misc = { - .minor = MISC_DYNAMIC_MINOR, + .minor = NTSYNC_MINOR, .name = NTSYNC_NAME, .fops = &ntsync_fops, }; @@ -51,3 +51,4 @@ MODULE_AUTHOR("Elizabeth Figura"); MODULE_DESCRIPTION("Kernel driver for NT synchronization primitives"); MODULE_LICENSE("GPL"); MODULE_ALIAS("devname:" NTSYNC_NAME); +MODULE_ALIAS_MISCDEV(NTSYNC_MINOR); diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index c0fea6ca5076..fe5d9366fdf7 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -71,6 +71,7 @@ #define USERIO_MINOR 240 #define VHOST_VSOCK_MINOR 241 #define RFKILL_MINOR 242 +#define NTSYNC_MINOR 243 #define MISC_DYNAMIC_MINOR 255
struct device;
These correspond to the NT syscalls NtCreateSemaphore() and NtClose(). Unlike those functions, however, these ioctls do not handle object names, or lookup of existing objects, or handle reference counting, but simply create the underlying primitive. The user space emulator is expected to implement those functions if they are required.
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/ntsync.c | 117 ++++++++++++++++++++++++++++++++++++ include/uapi/linux/ntsync.h | 25 ++++++++ 2 files changed, 142 insertions(+) create mode 100644 include/uapi/linux/ntsync.h
diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index 84b498e2b2d5..3287b94be351 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -8,23 +8,140 @@ #include <linux/fs.h> #include <linux/miscdevice.h> #include <linux/module.h> +#include <linux/slab.h> +#include <linux/xarray.h> +#include <uapi/linux/ntsync.h>
#define NTSYNC_NAME "ntsync"
+enum ntsync_type { + NTSYNC_TYPE_SEM, +}; + +struct ntsync_obj { + struct rcu_head rhead; + struct kref refcount; + + enum ntsync_type type; + + union { + struct { + __u32 count; + __u32 max; + } sem; + } u; +}; + +struct ntsync_device { + struct xarray objects; +}; + +static void destroy_obj(struct kref *ref) +{ + struct ntsync_obj *obj = container_of(ref, struct ntsync_obj, refcount); + + kfree_rcu(obj, rhead); +} + +static void put_obj(struct ntsync_obj *obj) +{ + kref_put(&obj->refcount, destroy_obj); +} + static int ntsync_char_open(struct inode *inode, struct file *file) { + struct ntsync_device *dev; + + dev = kzalloc(sizeof(*dev), GFP_KERNEL); + if (!dev) + return -ENOMEM; + + xa_init_flags(&dev->objects, XA_FLAGS_ALLOC); + + file->private_data = dev; return nonseekable_open(inode, file); }
static int ntsync_char_release(struct inode *inode, struct file *file) { + struct ntsync_device *dev = file->private_data; + struct ntsync_obj *obj; + unsigned long id; + + xa_for_each(&dev->objects, id, obj) + put_obj(obj); + + xa_destroy(&dev->objects); + + kfree(dev); + + return 0; +} + +static void init_obj(struct ntsync_obj *obj) +{ + kref_init(&obj->refcount); +} + +static int ntsync_create_sem(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_sem_args __user *user_args = argp; + struct ntsync_sem_args args; + struct ntsync_obj *sem; + __u32 id; + int ret; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + + if (args.count > args.max) + return -EINVAL; + + sem = kzalloc(sizeof(*sem), GFP_KERNEL); + if (!sem) + return -ENOMEM; + + init_obj(sem); + sem->type = NTSYNC_TYPE_SEM; + sem->u.sem.count = args.count; + sem->u.sem.max = args.max; + + ret = xa_alloc(&dev->objects, &id, sem, xa_limit_32b, GFP_KERNEL); + if (ret < 0) { + kfree(sem); + return ret; + } + + return put_user(id, &user_args->sem); +} + +static int ntsync_delete(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_obj *obj; + __u32 id; + + if (get_user(id, (__u32 __user *)argp)) + return -EFAULT; + + obj = xa_erase(&dev->objects, id); + if (!obj) + return -EINVAL; + + put_obj(obj); return 0; }
static long ntsync_char_ioctl(struct file *file, unsigned int cmd, unsigned long parm) { + struct ntsync_device *dev = file->private_data; + void __user *argp = (void __user *)parm; + switch (cmd) { + case NTSYNC_IOC_CREATE_SEM: + return ntsync_create_sem(dev, argp); + case NTSYNC_IOC_DELETE: + return ntsync_delete(dev, argp); default: return -ENOIOCTLCMD; } diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h new file mode 100644 index 000000000000..d97afc138dcc --- /dev/null +++ b/include/uapi/linux/ntsync.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Kernel support for NT synchronization primitive emulation + * + * Copyright (C) 2021-2022 Elizabeth Figura + */ + +#ifndef __LINUX_NTSYNC_H +#define __LINUX_NTSYNC_H + +#include <linux/types.h> + +struct ntsync_sem_args { + __u32 sem; + __u32 count; + __u32 max; +}; + +#define NTSYNC_IOC_BASE 0xf7 + +#define NTSYNC_IOC_CREATE_SEM _IOWR(NTSYNC_IOC_BASE, 0, \ + struct ntsync_sem_args) +#define NTSYNC_IOC_DELETE _IOW (NTSYNC_IOC_BASE, 1, __u32) + +#endif
This corresponds to the NT syscall NtReleaseSemaphore().
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/ntsync.c | 76 +++++++++++++++++++++++++++++++++++++ include/uapi/linux/ntsync.h | 2 + 2 files changed, 78 insertions(+)
diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index 3287b94be351..d1c91c2a4f1a 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -21,9 +21,11 @@ enum ntsync_type { struct ntsync_obj { struct rcu_head rhead; struct kref refcount; + spinlock_t lock;
enum ntsync_type type;
+ /* The following fields are protected by the object lock. */ union { struct { __u32 count; @@ -36,6 +38,19 @@ struct ntsync_device { struct xarray objects; };
+static struct ntsync_obj *get_obj(struct ntsync_device *dev, __u32 id) +{ + struct ntsync_obj *obj; + + rcu_read_lock(); + obj = xa_load(&dev->objects, id); + if (obj && !kref_get_unless_zero(&obj->refcount)) + obj = NULL; + rcu_read_unlock(); + + return obj; +} + static void destroy_obj(struct kref *ref) { struct ntsync_obj *obj = container_of(ref, struct ntsync_obj, refcount); @@ -48,6 +63,18 @@ static void put_obj(struct ntsync_obj *obj) kref_put(&obj->refcount, destroy_obj); }
+static struct ntsync_obj *get_obj_typed(struct ntsync_device *dev, __u32 id, + enum ntsync_type type) +{ + struct ntsync_obj *obj = get_obj(dev, id); + + if (obj && obj->type != type) { + put_obj(obj); + return NULL; + } + return obj; +} + static int ntsync_char_open(struct inode *inode, struct file *file) { struct ntsync_device *dev; @@ -81,6 +108,7 @@ static int ntsync_char_release(struct inode *inode, struct file *file) static void init_obj(struct ntsync_obj *obj) { kref_init(&obj->refcount); + spin_lock_init(&obj->lock); }
static int ntsync_create_sem(struct ntsync_device *dev, void __user *argp) @@ -131,6 +159,52 @@ static int ntsync_delete(struct ntsync_device *dev, void __user *argp) return 0; }
+/* + * Actually change the semaphore state, returning -EOVERFLOW if it is made + * invalid. + */ +static int put_sem_state(struct ntsync_obj *sem, __u32 count) +{ + lockdep_assert_held(&sem->lock); + + if (sem->u.sem.count + count < sem->u.sem.count || + sem->u.sem.count + count > sem->u.sem.max) + return -EOVERFLOW; + + sem->u.sem.count += count; + return 0; +} + +static int ntsync_put_sem(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_sem_args __user *user_args = argp; + struct ntsync_sem_args args; + struct ntsync_obj *sem; + __u32 prev_count; + int ret; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + + sem = get_obj_typed(dev, args.sem, NTSYNC_TYPE_SEM); + if (!sem) + return -EINVAL; + + spin_lock(&sem->lock); + + prev_count = sem->u.sem.count; + ret = put_sem_state(sem, args.count); + + spin_unlock(&sem->lock); + + put_obj(sem); + + if (!ret && put_user(prev_count, &user_args->count)) + ret = -EFAULT; + + return ret; +} + static long ntsync_char_ioctl(struct file *file, unsigned int cmd, unsigned long parm) { @@ -142,6 +216,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd, return ntsync_create_sem(dev, argp); case NTSYNC_IOC_DELETE: return ntsync_delete(dev, argp); + case NTSYNC_IOC_PUT_SEM: + return ntsync_put_sem(dev, argp); default: return -ENOIOCTLCMD; } diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h index d97afc138dcc..8c610d65f8ef 100644 --- a/include/uapi/linux/ntsync.h +++ b/include/uapi/linux/ntsync.h @@ -21,5 +21,7 @@ struct ntsync_sem_args { #define NTSYNC_IOC_CREATE_SEM _IOWR(NTSYNC_IOC_BASE, 0, \ struct ntsync_sem_args) #define NTSYNC_IOC_DELETE _IOW (NTSYNC_IOC_BASE, 1, __u32) +#define NTSYNC_IOC_PUT_SEM _IOWR(NTSYNC_IOC_BASE, 2, \ + struct ntsync_sem_args)
#endif
This corresponds to part of the functionality of the NT syscall NtWaitForMultipleObjects(). Specifically, it implements the behaviour where the third argument (wait_any) is TRUE, and it does not handle alertable waits. Those features have been split out into separate patches to ease review.
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/ntsync.c | 229 ++++++++++++++++++++++++++++++++++++ include/uapi/linux/ntsync.h | 13 ++ 2 files changed, 242 insertions(+)
diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index d1c91c2a4f1a..2e8d3c2d51a4 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -23,6 +23,8 @@ struct ntsync_obj { struct kref refcount; spinlock_t lock;
+ struct list_head any_waiters; + enum ntsync_type type;
/* The following fields are protected by the object lock. */ @@ -34,6 +36,28 @@ struct ntsync_obj { } u; };
+struct ntsync_q_entry { + struct list_head node; + struct ntsync_q *q; + struct ntsync_obj *obj; + __u32 index; +}; + +struct ntsync_q { + struct task_struct *task; + __u32 owner; + + /* + * Protected via atomic_cmpxchg(). Only the thread that wins the + * compare-and-swap may actually change object states and wake this + * task. + */ + atomic_t signaled; + + __u32 count; + struct ntsync_q_entry entries[]; +}; + struct ntsync_device { struct xarray objects; }; @@ -109,6 +133,26 @@ static void init_obj(struct ntsync_obj *obj) { kref_init(&obj->refcount); spin_lock_init(&obj->lock); + INIT_LIST_HEAD(&obj->any_waiters); +} + +static void try_wake_any_sem(struct ntsync_obj *sem) +{ + struct ntsync_q_entry *entry; + + lockdep_assert_held(&sem->lock); + + list_for_each_entry(entry, &sem->any_waiters, node) { + struct ntsync_q *q = entry->q; + + if (!sem->u.sem.count) + break; + + if (atomic_cmpxchg(&q->signaled, -1, entry->index) == -1) { + sem->u.sem.count--; + wake_up_process(q->task); + } + } }
static int ntsync_create_sem(struct ntsync_device *dev, void __user *argp) @@ -194,6 +238,8 @@ static int ntsync_put_sem(struct ntsync_device *dev, void __user *argp)
prev_count = sem->u.sem.count; ret = put_sem_state(sem, args.count); + if (!ret) + try_wake_any_sem(sem);
spin_unlock(&sem->lock);
@@ -205,6 +251,187 @@ static int ntsync_put_sem(struct ntsync_device *dev, void __user *argp) return ret; }
+static int ntsync_schedule(const struct ntsync_q *q, ktime_t *timeout) +{ + int ret = 0; + + do { + if (signal_pending(current)) { + ret = -ERESTARTSYS; + break; + } + + set_current_state(TASK_INTERRUPTIBLE); + if (atomic_read(&q->signaled) != -1) { + ret = 0; + break; + } + ret = schedule_hrtimeout(timeout, HRTIMER_MODE_ABS); + } while (ret < 0); + __set_current_state(TASK_RUNNING); + + return ret; +} + +/* + * Allocate and initialize the ntsync_q structure, but do not queue us yet. + * Also, calculate the relative timeout. + */ +static int setup_wait(struct ntsync_device *dev, + const struct ntsync_wait_args *args, + ktime_t *ret_timeout, struct ntsync_q **ret_q) +{ + const __u32 count = args->count; + struct ntsync_q *q; + ktime_t timeout = 0; + __u32 *ids; + __u32 i, j; + + if (!args->owner || args->pad) + return -EINVAL; + + if (args->count > NTSYNC_MAX_WAIT_COUNT) + return -EINVAL; + + if (args->timeout) { + struct timespec64 to; + + if (get_timespec64(&to, u64_to_user_ptr(args->timeout))) + return -EFAULT; + if (!timespec64_valid(&to)) + return -EINVAL; + + timeout = timespec64_to_ns(&to); + } + + ids = kmalloc_array(count, sizeof(*ids), GFP_KERNEL); + if (!ids) + return -ENOMEM; + if (copy_from_user(ids, u64_to_user_ptr(args->objs), + array_size(count, sizeof(*ids)))) { + kfree(ids); + return -EFAULT; + } + + q = kmalloc(struct_size(q, entries, count), GFP_KERNEL); + if (!q) { + kfree(ids); + return -ENOMEM; + } + q->task = current; + q->owner = args->owner; + atomic_set(&q->signaled, -1); + q->count = count; + + for (i = 0; i < count; i++) { + struct ntsync_q_entry *entry = &q->entries[i]; + struct ntsync_obj *obj = get_obj(dev, ids[i]); + + if (!obj) + goto err; + + entry->obj = obj; + entry->q = q; + entry->index = i; + } + + kfree(ids); + + *ret_q = q; + *ret_timeout = timeout; + return 0; + +err: + for (j = 0; j < i; j++) + put_obj(q->entries[j].obj); + kfree(ids); + kfree(q); + return -EINVAL; +} + +static void try_wake_any_obj(struct ntsync_obj *obj) +{ + switch (obj->type) { + case NTSYNC_TYPE_SEM: + try_wake_any_sem(obj); + break; + } +} + +static int ntsync_wait_any(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_wait_args args; + struct ntsync_q *q; + ktime_t timeout; + int signaled; + __u32 i; + int ret; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + + ret = setup_wait(dev, &args, &timeout, &q); + if (ret < 0) + return ret; + + /* queue ourselves */ + + for (i = 0; i < args.count; i++) { + struct ntsync_q_entry *entry = &q->entries[i]; + struct ntsync_obj *obj = entry->obj; + + spin_lock(&obj->lock); + list_add_tail(&entry->node, &obj->any_waiters); + spin_unlock(&obj->lock); + } + + /* check if we are already signaled */ + + for (i = 0; i < args.count; i++) { + struct ntsync_obj *obj = q->entries[i].obj; + + if (atomic_read(&q->signaled) != -1) + break; + + spin_lock(&obj->lock); + try_wake_any_obj(obj); + spin_unlock(&obj->lock); + } + + /* sleep */ + + ret = ntsync_schedule(q, args.timeout ? &timeout : NULL); + + /* and finally, unqueue */ + + for (i = 0; i < args.count; i++) { + struct ntsync_q_entry *entry = &q->entries[i]; + struct ntsync_obj *obj = entry->obj; + + spin_lock(&obj->lock); + list_del(&entry->node); + spin_unlock(&obj->lock); + + put_obj(obj); + } + + signaled = atomic_read(&q->signaled); + if (signaled != -1) { + struct ntsync_wait_args __user *user_args = argp; + + /* even if we caught a signal, we need to communicate success */ + ret = 0; + + if (put_user(signaled, &user_args->index)) + ret = -EFAULT; + } else if (!ret) { + ret = -ETIMEDOUT; + } + + kfree(q); + return ret; +} + static long ntsync_char_ioctl(struct file *file, unsigned int cmd, unsigned long parm) { @@ -218,6 +445,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd, return ntsync_delete(dev, argp); case NTSYNC_IOC_PUT_SEM: return ntsync_put_sem(dev, argp); + case NTSYNC_IOC_WAIT_ANY: + return ntsync_wait_any(dev, argp); default: return -ENOIOCTLCMD; } diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h index 8c610d65f8ef..10f07da7864e 100644 --- a/include/uapi/linux/ntsync.h +++ b/include/uapi/linux/ntsync.h @@ -16,6 +16,17 @@ struct ntsync_sem_args { __u32 max; };
+struct ntsync_wait_args { + __u64 timeout; + __u64 objs; + __u32 count; + __u32 owner; + __u32 index; + __u32 pad; +}; + +#define NTSYNC_MAX_WAIT_COUNT 64 + #define NTSYNC_IOC_BASE 0xf7
#define NTSYNC_IOC_CREATE_SEM _IOWR(NTSYNC_IOC_BASE, 0, \ @@ -23,5 +34,7 @@ struct ntsync_sem_args { #define NTSYNC_IOC_DELETE _IOW (NTSYNC_IOC_BASE, 1, __u32) #define NTSYNC_IOC_PUT_SEM _IOWR(NTSYNC_IOC_BASE, 2, \ struct ntsync_sem_args) +#define NTSYNC_IOC_WAIT_ANY _IOWR(NTSYNC_IOC_BASE, 3, \ + struct ntsync_wait_args)
#endif
This corresponds to part of the functionality of the NT syscall NtWaitForMultipleObjects(). Specifically, it implements the behaviour where the third argument (wait_any) is FALSE, and it does not yet handle alertable waits.
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/ntsync.c | 241 ++++++++++++++++++++++++++++++++++-- include/uapi/linux/ntsync.h | 2 + 2 files changed, 235 insertions(+), 8 deletions(-)
diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index 2e8d3c2d51a4..2685363fae9e 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -23,7 +23,34 @@ struct ntsync_obj { struct kref refcount; spinlock_t lock;
+ /* + * any_waiters is protected by the object lock, but all_waiters is + * protected by the device wait_all_lock. + */ struct list_head any_waiters; + struct list_head all_waiters; + + /* + * Hint describing how many tasks are queued on this object in a + * wait-all operation. + * + * Any time we do a wake, we may need to wake "all" waiters as well as + * "any" waiters. In order to atomically wake "all" waiters, we must + * lock all of the objects, and that means grabbing the wait_all_lock + * below (and, due to lock ordering rules, before locking this object). + * However, wait-all is a rare operation, and grabbing the wait-all + * lock for every wake would create unnecessary contention. Therefore we + * first check whether all_hint is zero, and, if it is, we skip trying + * to wake "all" waiters. + * + * This hint isn't protected by any lock. It might change during the + * course of a wake, but there's no meaningful race there; it's only a + * hint. + * + * Since wait requests must originate from user-space threads, we're + * limited here by PID_MAX_LIMIT, so there's no risk of saturation. + */ + atomic_t all_hint;
enum ntsync_type type;
@@ -54,11 +81,25 @@ struct ntsync_q { */ atomic_t signaled;
+ bool all; __u32 count; struct ntsync_q_entry entries[]; };
struct ntsync_device { + /* + * Wait-all operations must atomically grab all objects, and be totally + * ordered with respect to each other and wait-any operations. If one + * thread is trying to acquire several objects, another thread cannot + * touch the object at the same time. + * + * We achieve this by grabbing multiple object locks at the same time. + * However, this creates a lock ordering problem. To solve that problem, + * wait_all_lock is taken first whenever multiple objects must be locked + * at the same time. + */ + spinlock_t wait_all_lock; + struct xarray objects; };
@@ -107,6 +148,8 @@ static int ntsync_char_open(struct inode *inode, struct file *file) if (!dev) return -ENOMEM;
+ spin_lock_init(&dev->wait_all_lock); + xa_init_flags(&dev->objects, XA_FLAGS_ALLOC);
file->private_data = dev; @@ -132,8 +175,81 @@ static int ntsync_char_release(struct inode *inode, struct file *file) static void init_obj(struct ntsync_obj *obj) { kref_init(&obj->refcount); + atomic_set(&obj->all_hint, 0); spin_lock_init(&obj->lock); INIT_LIST_HEAD(&obj->any_waiters); + INIT_LIST_HEAD(&obj->all_waiters); +} + +static bool is_signaled(struct ntsync_obj *obj, __u32 owner) +{ + lockdep_assert_held(&obj->lock); + + switch (obj->type) { + case NTSYNC_TYPE_SEM: + return !!obj->u.sem.count; + } + + WARN(1, "bad object type %#x\n", obj->type); + return false; +} + +/* + * "locked_obj" is an optional pointer to an object which is already locked and + * should not be locked again. This is necessary so that changing an object's + * state and waking it can be a single atomic operation. + */ +static void try_wake_all(struct ntsync_device *dev, struct ntsync_q *q, + struct ntsync_obj *locked_obj) +{ + __u32 count = q->count; + bool can_wake = true; + __u32 i; + + lockdep_assert_held(&dev->wait_all_lock); + if (locked_obj) + lockdep_assert_held(&locked_obj->lock); + + for (i = 0; i < count; i++) { + if (q->entries[i].obj != locked_obj) + spin_lock_nest_lock(&q->entries[i].obj->lock, &dev->wait_all_lock); + } + + for (i = 0; i < count; i++) { + if (!is_signaled(q->entries[i].obj, q->owner)) { + can_wake = false; + break; + } + } + + if (can_wake && atomic_cmpxchg(&q->signaled, -1, 0) == -1) { + for (i = 0; i < count; i++) { + struct ntsync_obj *obj = q->entries[i].obj; + + switch (obj->type) { + case NTSYNC_TYPE_SEM: + obj->u.sem.count--; + break; + } + } + wake_up_process(q->task); + } + + for (i = 0; i < count; i++) { + if (q->entries[i].obj != locked_obj) + spin_unlock(&q->entries[i].obj->lock); + } +} + +static void try_wake_all_obj(struct ntsync_device *dev, struct ntsync_obj *obj) +{ + struct ntsync_q_entry *entry; + + lockdep_assert_held(&dev->wait_all_lock); + lockdep_assert_held(&obj->lock); + + list_for_each_entry(entry, &obj->all_waiters, node) + try_wake_all(dev, entry->q, obj); }
static void try_wake_any_sem(struct ntsync_obj *sem) @@ -234,14 +350,29 @@ static int ntsync_put_sem(struct ntsync_device *dev, void __user *argp) if (!sem) return -EINVAL;
- spin_lock(&sem->lock); + if (atomic_read(&sem->all_hint) > 0) { + spin_lock(&dev->wait_all_lock); + spin_lock_nest_lock(&sem->lock, &dev->wait_all_lock);
- prev_count = sem->u.sem.count; - ret = put_sem_state(sem, args.count); - if (!ret) - try_wake_any_sem(sem); + prev_count = sem->u.sem.count; + ret = put_sem_state(sem, args.count); + if (!ret) { + try_wake_all_obj(dev, sem); + try_wake_any_sem(sem); + }
- spin_unlock(&sem->lock); + spin_unlock(&sem->lock); + spin_unlock(&dev->wait_all_lock); + } else { + spin_lock(&sem->lock); + + prev_count = sem->u.sem.count; + ret = put_sem_state(sem, args.count); + if (!ret) + try_wake_any_sem(sem); + + spin_unlock(&sem->lock); + }
put_obj(sem);
@@ -278,7 +409,7 @@ static int ntsync_schedule(const struct ntsync_q *q, ktime_t *timeout) * Also, calculate the relative timeout. */ static int setup_wait(struct ntsync_device *dev, - const struct ntsync_wait_args *args, + const struct ntsync_wait_args *args, bool all, ktime_t *ret_timeout, struct ntsync_q **ret_q) { const __u32 count = args->count; @@ -321,6 +452,7 @@ static int setup_wait(struct ntsync_device *dev, q->task = current; q->owner = args->owner; atomic_set(&q->signaled, -1); + q->all = all; q->count = count;
for (i = 0; i < count; i++) { @@ -330,6 +462,16 @@ static int setup_wait(struct ntsync_device *dev, if (!obj) goto err;
+ if (all) { + /* Check that the objects are all distinct. */ + for (j = 0; j < i; j++) { + if (obj == q->entries[j].obj) { + put_obj(obj); + goto err; + } + } + } + entry->obj = obj; entry->q = q; entry->index = i; @@ -370,7 +512,7 @@ static int ntsync_wait_any(struct ntsync_device *dev, void __user *argp) if (copy_from_user(&args, argp, sizeof(args))) return -EFAULT;
- ret = setup_wait(dev, &args, &timeout, &q); + ret = setup_wait(dev, &args, false, &timeout, &q); if (ret < 0) return ret;
@@ -432,6 +574,87 @@ static int ntsync_wait_any(struct ntsync_device *dev, void __user *argp) return ret; }
+static int ntsync_wait_all(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_wait_args args; + struct ntsync_q *q; + ktime_t timeout; + int signaled; + __u32 i; + int ret; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + + ret = setup_wait(dev, &args, true, &timeout, &q); + if (ret < 0) + return ret; + + /* queue ourselves */ + + spin_lock(&dev->wait_all_lock); + + for (i = 0; i < args.count; i++) { + struct ntsync_q_entry *entry = &q->entries[i]; + struct ntsync_obj *obj = entry->obj; + + atomic_inc(&obj->all_hint); + + /* + * obj->all_waiters is protected by dev->wait_all_lock rather + * than obj->lock, so there is no need to acquire it here. + */ + list_add_tail(&entry->node, &obj->all_waiters); + } + + /* check if we are already signaled */ + + try_wake_all(dev, q, NULL); + + spin_unlock(&dev->wait_all_lock); + + /* sleep */ + + ret = ntsync_schedule(q, args.timeout ? &timeout : NULL); + + /* and finally, unqueue */ + + spin_lock(&dev->wait_all_lock); + + for (i = 0; i < args.count; i++) { + struct ntsync_q_entry *entry = &q->entries[i]; + struct ntsync_obj *obj = entry->obj; + + /* + * obj->all_waiters is protected by dev->wait_all_lock rather + * than obj->lock, so there is no need to acquire it here. + */ + list_del(&entry->node); + + atomic_dec(&obj->all_hint); + + put_obj(obj); + } + + spin_unlock(&dev->wait_all_lock); + + signaled = atomic_read(&q->signaled); + if (signaled != -1) { + struct ntsync_wait_args __user *user_args = argp; + + /* even if we caught a signal, we need to communicate success */ + ret = 0; + + if (put_user(signaled, &user_args->index)) + ret = -EFAULT; + } else if (!ret) { + ret = -ETIMEDOUT; + } + + kfree(q); + return ret; +} + static long ntsync_char_ioctl(struct file *file, unsigned int cmd, unsigned long parm) { @@ -445,6 +668,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd, return ntsync_delete(dev, argp); case NTSYNC_IOC_PUT_SEM: return ntsync_put_sem(dev, argp); + case NTSYNC_IOC_WAIT_ALL: + return ntsync_wait_all(dev, argp); case NTSYNC_IOC_WAIT_ANY: return ntsync_wait_any(dev, argp); default: diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h index 10f07da7864e..a5bed5a39b21 100644 --- a/include/uapi/linux/ntsync.h +++ b/include/uapi/linux/ntsync.h @@ -36,5 +36,7 @@ struct ntsync_wait_args { struct ntsync_sem_args) #define NTSYNC_IOC_WAIT_ANY _IOWR(NTSYNC_IOC_BASE, 3, \ struct ntsync_wait_args) +#define NTSYNC_IOC_WAIT_ALL _IOWR(NTSYNC_IOC_BASE, 4, \ + struct ntsync_wait_args)
#endif
This corresponds to the NT syscall NtCreateMutant().
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/ntsync.c | 72 +++++++++++++++++++++++++++++++++++++ include/uapi/linux/ntsync.h | 8 +++++ 2 files changed, 80 insertions(+)
diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index 2685363fae9e..d48f2ef41341 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -16,6 +16,7 @@
enum ntsync_type { NTSYNC_TYPE_SEM, + NTSYNC_TYPE_MUTEX, };
struct ntsync_obj { @@ -60,6 +61,10 @@ struct ntsync_obj { __u32 count; __u32 max; } sem; + struct { + __u32 count; + __u32 owner; + } mutex; } u; };
@@ -188,6 +193,10 @@ static bool is_signaled(struct ntsync_obj *obj, __u32 owner) switch (obj->type) { case NTSYNC_TYPE_SEM: return !!obj->u.sem.count; + case NTSYNC_TYPE_MUTEX: + if (obj->u.mutex.owner && obj->u.mutex.owner != owner) + return false; + return obj->u.mutex.count < UINT_MAX; }
WARN(1, "bad object type %#x\n", obj->type); @@ -230,6 +239,10 @@ static void try_wake_all(struct ntsync_device *dev, struct ntsync_q *q, case NTSYNC_TYPE_SEM: obj->u.sem.count--; break; + case NTSYNC_TYPE_MUTEX: + obj->u.mutex.count++; + obj->u.mutex.owner = q->owner; + break; } } wake_up_process(q->task); @@ -271,6 +284,28 @@ static void try_wake_any_sem(struct ntsync_obj *sem) } }
+static void try_wake_any_mutex(struct ntsync_obj *mutex) +{ + struct ntsync_q_entry *entry; + + lockdep_assert_held(&mutex->lock); + + list_for_each_entry(entry, &mutex->any_waiters, node) { + struct ntsync_q *q = entry->q; + + if (mutex->u.mutex.count == UINT_MAX) + break; + if (mutex->u.mutex.owner && mutex->u.mutex.owner != q->owner) + continue; + + if (atomic_cmpxchg(&q->signaled, -1, entry->index) == -1) { + mutex->u.mutex.count++; + mutex->u.mutex.owner = q->owner; + wake_up_process(q->task); + } + } +} + static int ntsync_create_sem(struct ntsync_device *dev, void __user *argp) { struct ntsync_sem_args __user *user_args = argp; @@ -303,6 +338,38 @@ static int ntsync_create_sem(struct ntsync_device *dev, void __user *argp) return put_user(id, &user_args->sem); }
+static int ntsync_create_mutex(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_mutex_args __user *user_args = argp; + struct ntsync_mutex_args args; + struct ntsync_obj *mutex; + __u32 id; + int ret; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + + if (!args.owner != !args.count) + return -EINVAL; + + mutex = kzalloc(sizeof(*mutex), GFP_KERNEL); + if (!mutex) + return -ENOMEM; + + init_obj(mutex); + mutex->type = NTSYNC_TYPE_MUTEX; + mutex->u.mutex.count = args.count; + mutex->u.mutex.owner = args.owner; + + ret = xa_alloc(&dev->objects, &id, mutex, xa_limit_32b, GFP_KERNEL); + if (ret < 0) { + kfree(mutex); + return ret; + } + + return put_user(id, &user_args->mutex); +} + static int ntsync_delete(struct ntsync_device *dev, void __user *argp) { struct ntsync_obj *obj; @@ -497,6 +564,9 @@ static void try_wake_any_obj(struct ntsync_obj *obj) case NTSYNC_TYPE_SEM: try_wake_any_sem(obj); break; + case NTSYNC_TYPE_MUTEX: + try_wake_any_mutex(obj); + break; } }
@@ -662,6 +732,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd, void __user *argp = (void __user *)parm;
switch (cmd) { + case NTSYNC_IOC_CREATE_MUTEX: + return ntsync_create_mutex(dev, argp); case NTSYNC_IOC_CREATE_SEM: return ntsync_create_sem(dev, argp); case NTSYNC_IOC_DELETE: diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h index a5bed5a39b21..26d1b3d4847f 100644 --- a/include/uapi/linux/ntsync.h +++ b/include/uapi/linux/ntsync.h @@ -16,6 +16,12 @@ struct ntsync_sem_args { __u32 max; };
+struct ntsync_mutex_args { + __u32 mutex; + __u32 owner; + __u32 count; +}; + struct ntsync_wait_args { __u64 timeout; __u64 objs; @@ -38,5 +44,7 @@ struct ntsync_wait_args { struct ntsync_wait_args) #define NTSYNC_IOC_WAIT_ALL _IOWR(NTSYNC_IOC_BASE, 4, \ struct ntsync_wait_args) +#define NTSYNC_IOC_CREATE_MUTEX _IOWR(NTSYNC_IOC_BASE, 5, \ + struct ntsync_mutex_args)
#endif
This corresponds to the NT syscall NtReleaseMutant().
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/ntsync.c | 67 +++++++++++++++++++++++++++++++++++++ include/uapi/linux/ntsync.h | 2 ++ 2 files changed, 69 insertions(+)
diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index d48f2ef41341..28f43768d1c3 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -449,6 +449,71 @@ static int ntsync_put_sem(struct ntsync_device *dev, void __user *argp) return ret; }
+/* + * Actually change the mutex state, returning -EPERM if not the owner. + */ +static int put_mutex_state(struct ntsync_obj *mutex, + const struct ntsync_mutex_args *args) +{ + lockdep_assert_held(&mutex->lock); + + if (mutex->u.mutex.owner != args->owner) + return -EPERM; + + if (!--mutex->u.mutex.count) + mutex->u.mutex.owner = 0; + return 0; +} + +static int ntsync_put_mutex(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_mutex_args __user *user_args = argp; + struct ntsync_mutex_args args; + struct ntsync_obj *mutex; + __u32 prev_count; + int ret; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + if (!args.owner) + return -EINVAL; + + mutex = get_obj_typed(dev, args.mutex, NTSYNC_TYPE_MUTEX); + if (!mutex) + return -EINVAL; + + if (atomic_read(&mutex->all_hint) > 0) { + spin_lock(&dev->wait_all_lock); + spin_lock_nest_lock(&mutex->lock, &dev->wait_all_lock); + + prev_count = mutex->u.mutex.count; + ret = put_mutex_state(mutex, &args); + if (!ret) { + try_wake_all_obj(dev, mutex); + try_wake_any_mutex(mutex); + } + + spin_unlock(&mutex->lock); + spin_unlock(&dev->wait_all_lock); + } else { + spin_lock(&mutex->lock); + + prev_count = mutex->u.mutex.count; + ret = put_mutex_state(mutex, &args); + if (!ret) + try_wake_any_mutex(mutex); + + spin_unlock(&mutex->lock); + } + + put_obj(mutex); + + if (!ret && put_user(prev_count, &user_args->count)) + ret = -EFAULT; + + return ret; +} + static int ntsync_schedule(const struct ntsync_q *q, ktime_t *timeout) { int ret = 0; @@ -738,6 +803,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd, return ntsync_create_sem(dev, argp); case NTSYNC_IOC_DELETE: return ntsync_delete(dev, argp); + case NTSYNC_IOC_PUT_MUTEX: + return ntsync_put_mutex(dev, argp); case NTSYNC_IOC_PUT_SEM: return ntsync_put_sem(dev, argp); case NTSYNC_IOC_WAIT_ALL: diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h index 26d1b3d4847f..2e44e7e77776 100644 --- a/include/uapi/linux/ntsync.h +++ b/include/uapi/linux/ntsync.h @@ -46,5 +46,7 @@ struct ntsync_wait_args { struct ntsync_wait_args) #define NTSYNC_IOC_CREATE_MUTEX _IOWR(NTSYNC_IOC_BASE, 5, \ struct ntsync_mutex_args) +#define NTSYNC_IOC_PUT_MUTEX _IOWR(NTSYNC_IOC_BASE, 6, \ + struct ntsync_mutex_args)
#endif
This does not correspond to any NT syscall, but rather should be called by the user-space NT emulator when a thread dies. It is responsible for marking any mutexes owned by that thread as abandoned.
Signed-off-by: Elizabeth Figura zfigura@codeweavers.com --- drivers/misc/ntsync.c | 80 ++++++++++++++++++++++++++++++++++++- include/uapi/linux/ntsync.h | 1 + 2 files changed, 79 insertions(+), 2 deletions(-)
diff --git a/drivers/misc/ntsync.c b/drivers/misc/ntsync.c index 28f43768d1c3..1173c750c106 100644 --- a/drivers/misc/ntsync.c +++ b/drivers/misc/ntsync.c @@ -64,6 +64,7 @@ struct ntsync_obj { struct { __u32 count; __u32 owner; + bool ownerdead; } mutex; } u; }; @@ -87,6 +88,7 @@ struct ntsync_q { atomic_t signaled;
bool all; + bool ownerdead; __u32 count; struct ntsync_q_entry entries[]; }; @@ -240,6 +242,9 @@ static void try_wake_all(struct ntsync_device *dev, struct ntsync_q *q, obj->u.sem.count--; break; case NTSYNC_TYPE_MUTEX: + if (obj->u.mutex.ownerdead) + q->ownerdead = true; + obj->u.mutex.ownerdead = false; obj->u.mutex.count++; obj->u.mutex.owner = q->owner; break; @@ -299,6 +304,9 @@ static void try_wake_any_mutex(struct ntsync_obj *mutex) continue;
if (atomic_cmpxchg(&q->signaled, -1, entry->index) == -1) { + if (mutex->u.mutex.ownerdead) + q->ownerdead = true; + mutex->u.mutex.ownerdead = false; mutex->u.mutex.count++; mutex->u.mutex.owner = q->owner; wake_up_process(q->task); @@ -514,6 +522,71 @@ static int ntsync_put_mutex(struct ntsync_device *dev, void __user *argp) return ret; }
+/* + * Actually change the mutex state to mark its owner as dead. + */ +static void put_mutex_ownerdead_state(struct ntsync_obj *mutex) +{ + lockdep_assert_held(&mutex->lock); + + mutex->u.mutex.ownerdead = true; + mutex->u.mutex.owner = 0; + mutex->u.mutex.count = 0; +} + +static int ntsync_kill_owner(struct ntsync_device *dev, void __user *argp) +{ + struct ntsync_obj *obj; + unsigned long id; + __u32 owner; + + if (get_user(owner, (__u32 __user *)argp)) + return -EFAULT; + if (!owner) + return -EINVAL; + + rcu_read_lock(); + + xa_for_each(&dev->objects, id, obj) { + if (!kref_get_unless_zero(&obj->refcount)) + continue; + + if (obj->type != NTSYNC_TYPE_MUTEX) { + put_obj(obj); + continue; + } + + if (atomic_read(&obj->all_hint) > 0) { + spin_lock(&dev->wait_all_lock); + spin_lock_nest_lock(&obj->lock, &dev->wait_all_lock); + + if (obj->u.mutex.owner == owner) { + put_mutex_ownerdead_state(obj); + try_wake_all_obj(dev, obj); + try_wake_any_mutex(obj); + } + + spin_unlock(&obj->lock); + spin_unlock(&dev->wait_all_lock); + } else { + spin_lock(&obj->lock); + + if (obj->u.mutex.owner == owner) { + put_mutex_ownerdead_state(obj); + try_wake_any_mutex(obj); + } + + spin_unlock(&obj->lock); + } + + put_obj(obj); + } + + rcu_read_unlock(); + + return 0; +} + static int ntsync_schedule(const struct ntsync_q *q, ktime_t *timeout) { int ret = 0; @@ -585,6 +658,7 @@ static int setup_wait(struct ntsync_device *dev, q->owner = args->owner; atomic_set(&q->signaled, -1); q->all = all; + q->ownerdead = false; q->count = count;
for (i = 0; i < count; i++) { @@ -697,7 +771,7 @@ static int ntsync_wait_any(struct ntsync_device *dev, void __user *argp) struct ntsync_wait_args __user *user_args = argp;
/* even if we caught a signal, we need to communicate success */ - ret = 0; + ret = q->ownerdead ? -EOWNERDEAD : 0;
if (put_user(signaled, &user_args->index)) ret = -EFAULT; @@ -778,7 +852,7 @@ static int ntsync_wait_all(struct ntsync_device *dev, void __user *argp) struct ntsync_wait_args __user *user_args = argp;
/* even if we caught a signal, we need to communicate success */ - ret = 0; + ret = q->ownerdead ? -EOWNERDEAD : 0;
if (put_user(signaled, &user_args->index)) ret = -EFAULT; @@ -803,6 +877,8 @@ static long ntsync_char_ioctl(struct file *file, unsigned int cmd, return ntsync_create_sem(dev, argp); case NTSYNC_IOC_DELETE: return ntsync_delete(dev, argp); + case NTSYNC_IOC_KILL_OWNER: + return ntsync_kill_owner(dev, argp); case NTSYNC_IOC_PUT_MUTEX: return ntsync_put_mutex(dev, argp); case NTSYNC_IOC_PUT_SEM: diff --git a/include/uapi/linux/ntsync.h b/include/uapi/linux/ntsync.h index 2e44e7e77776..fec9a3993322 100644 --- a/include/uapi/linux/ntsync.h +++ b/include/uapi/linux/ntsync.h @@ -48,5 +48,6 @@ struct ntsync_wait_args { struct ntsync_mutex_args) #define NTSYNC_IOC_PUT_MUTEX _IOWR(NTSYNC_IOC_BASE, 6, \ struct ntsync_mutex_args) +#define NTSYNC_IOC_KILL_OWNER _IOW (NTSYNC_IOC_BASE, 7, __u32)
#endif