KeCancelTimer (which is called via KeSetTimerEx) would block indefinitely in the DPC, while waiting for the timer to complete. Which never happens, since the internal thread pool implementation never gets the chance to wake up the "finished_event" condition variable on the DPC exit in tp_object_execute().
From: Ivo Ivanov logos128@gmail.com
--- dlls/ntdll/threadpool.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/dlls/ntdll/threadpool.c b/dlls/ntdll/threadpool.c index 99525f831e1..0d88ebd34a0 100644 --- a/dlls/ntdll/threadpool.c +++ b/dlls/ntdll/threadpool.c @@ -2975,6 +2975,9 @@ VOID WINAPI TpSetTimer( TP_TIMER *timer, LARGE_INTEGER *timeout, LONG period, LO
TRACE( "%p %p %lu %lu\n", timer, timeout, period, window_length );
+ /* If the timer has any pending callbacks, cancel them first. */ + tp_object_cancel( this ); + RtlEnterCriticalSection( &timerqueue.cs );
assert( this->u.timer.timer_initialized );
From: Ivo Ivanov logos128@gmail.com
Fixes NaturalPoint's TrackIR5 app hanging indefinitely on exit, while waiting synchronously on an IRP to complete. The blocking happens in the app's npusbio_x64 driver, while it tries to reset a timer in its DPC. Probably fixes other drivers/apps in such situations. --- dlls/ntoskrnl.exe/sync.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/dlls/ntoskrnl.exe/sync.c b/dlls/ntoskrnl.exe/sync.c index d9b5726b920..0b101158eb6 100644 --- a/dlls/ntoskrnl.exe/sync.c +++ b/dlls/ntoskrnl.exe/sync.c @@ -493,10 +493,14 @@ BOOLEAN WINAPI KeSetTimerEx( KTIMER *timer, LARGE_INTEGER duetime, LONG period,
EnterCriticalSection( &sync_cs );
- if ((ret = timer->Header.Inserted)) - KeCancelTimer(timer); - + ret = timer->Header.Inserted; timer->Header.Inserted = TRUE; + timer->Header.SignalState = FALSE; + if (timer->Header.WaitListHead.Blink && !*((ULONG_PTR *)&timer->Header.WaitListHead.Flink)) + { + CloseHandle(timer->Header.WaitListHead.Blink); + timer->Header.WaitListHead.Blink = NULL; + }
if (!timer->TimerListEntry.Blink) timer->TimerListEntry.Blink = (void *)CreateThreadpoolTimer(ke_timer_complete_proc, timer, NULL);
From: Ivo Ivanov logos128@gmail.com
--- dlls/ntoskrnl.exe/tests/driver.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/dlls/ntoskrnl.exe/tests/driver.c b/dlls/ntoskrnl.exe/tests/driver.c index d293889c823..dfdca2e24ca 100644 --- a/dlls/ntoskrnl.exe/tests/driver.c +++ b/dlls/ntoskrnl.exe/tests/driver.c @@ -578,6 +578,14 @@ static void WINAPI test_sync_dpc(KDPC *dpc, void *context, void *system_argument c->called = TRUE; }
+static void WINAPI test_sync_dpc_2(KDPC *dpc, void *context, void *system_argument1, void *system_argument2) +{ + KTIMER *timer = context; + LARGE_INTEGER timeout = {.QuadPart = -20 * 10000}; + + KeSetTimerEx(timer, timeout, 0, dpc); +} + static void test_sync(void) { static const ULONG wine_tag = 0x454e4957; /* WINE */ @@ -591,7 +599,7 @@ static void test_sync(void) void *objs[2]; KTIMER timer; NTSTATUS ret; - KDPC dpc; + KDPC dpc, dpc_2; int i;
KeInitializeEvent(&manual_event, NotificationEvent, FALSE); @@ -925,6 +933,17 @@ static void test_sync(void) ret = wait_single(&timer, -40 * 10000); ok(ret == WAIT_TIMEOUT, "got %#lx\n", ret);
+ KeCancelTimer(&timer); + /* Test re-setting timer in dpc */ + KeInitializeDpc(&dpc_2, test_sync_dpc_2, &timer); + KeSetTimerEx(&timer, timeout, 0, &dpc_2); + ret = wait_single(&timer, -40 * 10000); + ok(ret == 0, "got %#lx\n", ret); + ret = wait_single(&timer, -40 * 10000); + ok(ret == 0, "got %#lx\n", ret); + ret = wait_single(&timer, -40 * 10000); + ok(ret == 0, "got %#lx\n", ret); + KeCancelTimer(&timer); /* Test reinitializing timer. */ KeSetTimerEx(&timer, timeout, 0, &dpc);
Hi,
It looks like your patch introduced the new failures shown below. Please investigate and fix them before resubmitting your patch. If they are not new, fixing them anyway would help a lot. Otherwise please ask for the known failures list to be updated.
The tests also ran into some preexisting test failures. If you know how to fix them that would be helpful. See the TestBot job for the details:
The full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=126832
Your paranoid android.
=== w8 (32 bit report) ===
ntoskrnl.exe: ntoskrnl.c:1218: Test failed: Got unexpected ret -1. ntoskrnl.c:1219: Test failed: Received unexpected data.
Zebediah Figura (@zfigura) commented about dlls/ntdll/threadpool.c:
TRACE( "%p %p %lu %lu\n", timer, timeout, period, window_length );
- /* If the timer has any pending callbacks, cancel them first. */
- tp_object_cancel( this );
Why do we need this?
Zebediah Figura (@zfigura) commented about dlls/ntoskrnl.exe/sync.c:
EnterCriticalSection( &sync_cs );
- if ((ret = timer->Header.Inserted))
KeCancelTimer(timer);
ret = timer->Header.Inserted; timer->Header.Inserted = TRUE;
timer->Header.SignalState = FALSE;
if (timer->Header.WaitListHead.Blink && !*((ULONG_PTR *)&timer->Header.WaitListHead.Flink))
{
CloseHandle(timer->Header.WaitListHead.Blink);
timer->Header.WaitListHead.Blink = NULL;
}
if (!timer->TimerListEntry.Blink) timer->TimerListEntry.Blink = (void *)CreateThreadpoolTimer(ke_timer_complete_proc, timer, NULL);
I'm a bit concerned about this change, because my assumption—although I never tried to validate it—is that KeSetTimerEx() should probably act as a fence, i.e. it shouldn't be possible for an earlier callback to execute simultaneously or after it's returned. In this case that could result in a non-periodic DPC executing twice, which seems dangerous.
On Mon Nov 28 21:55:48 2022 +0000, Zebediah Figura wrote:
I'm a bit concerned about this change, because my assumption—although I never tried to validate it—is that KeSetTimerEx() should probably act as a fence, i.e. it shouldn't be possible for an earlier callback to execute simultaneously or after it's returned. In this case that could result in a non-periodic DPC executing twice, which seems dangerous.
Note that 1/3 won't actually help you, because we could already be in ke_timer_complete_proc() but not executing the callback yet.
On Mon Nov 28 21:35:09 2022 +0000, Zebediah Figura wrote:
Why do we need this?
This cancels the thread pool timer on reset, so it won't call any pending callbacks. Otherwise it only removes the pending timers, which is not enough to stop the pending callbacks. This also allows us to remove the blocking KeCancelTimer() call from KeSetTimerEx(), which uses the thread pool implementation.
On Mon Nov 28 21:57:01 2022 +0000, Zebediah Figura wrote:
Note that 1/3 won't actually help you, because we could already be in ke_timer_complete_proc() but not executing the callback yet.
Yes, that won't be enough to stop the pending callbacks. The change in TpSetTimer handles this on thread pool level. If there is ke_timer_complete_proc() running at this time, it will complete in parallel. I have to dig the MS documentation again, but I'm pretty sure they don't guarantee immediate cancellation of pending DPCs after canceling an object. Since the DPCs are queued and run on another IRQ level, I assume they can't be dequeued and stopped easily in time.
On Tue Nov 29 05:00:35 2022 +0000, Ivo Ivanov wrote:
This cancels the thread pool timer on reset, so it won't call any pending callbacks. Otherwise it only removes the pending timers, which is not enough to stop the pending callbacks. This also allows us to remove the blocking KeCancelTimer() call from KeSetTimerEx(), which uses the thread pool implementation.
If callbacks are pending at this point, then they've been triggered and are just waiting for thread space to be executed. On the other hand, this isn't going to stop callbacks that are already in process. So this doesn't guarantee anything, and in light of that I don't see why cancelling queued callbacks is necessary or even helpful.
On Tue Nov 29 06:05:51 2022 +0000, Ivo Ivanov wrote:
Yes, that won't be enough to stop the pending callbacks. The change in TpSetTimer handles this on thread pool level. If there is ke_timer_complete_proc() running at this time, it will complete in parallel. I have to dig the MS documentation again, but I'm pretty sure they don't guarantee immediate cancellation of pending DPCs after canceling an object. Since the DPCs are queued and run on another IRQ level, I assume they can't be dequeued and stopped easily in time.
That sounds believable, although it'd be nice to be sure. More concerning to me, though, is that currently we can execute the "wrong" DPC if the callback was triggered during/after a concurrent KeSetTimer().
If callbacks are pending at this point, then they've been triggered and are just waiting for thread space to be executed.
The purpose of the call here is exactly that - to cancel any pending callbacks that are about to be executed.
On the other hand, this isn't going to stop callbacks that are already in process.
Yes. According to the MS documentation for SetThreadpoolTimer(): _"In some cases, callback functions might run after an application closes the threadpool timer."_ (Remarks section) https://learn.microsoft.com/en-us/windows/win32/api/threadpoolapiset/nf-thre...
Although they seem to imply that this includes also the queued callbacks, not just those in the execute phase: (pftDueTime parameter) _"If this parameter is NULL, the timer object will cease to queue new callbacks (but callbacks already queued will still occur)."_
And also in the Remarks section of CloseThreadpoolTimer(): _"The timer object is freed immediately if there are no outstanding callbacks; otherwise, the timer object is freed asynchronously after the outstanding callback functions complete."_ https://learn.microsoft.com/en-us/windows/win32/api/threadpoolapiset/nf-thre...
Since the tp_object_cancel() removes pending callbacks, it can probably be removed. It won't make much of a difference though, and I think it makes more sense to cancel the pending callbacks there.
On Tue Nov 29 17:01:12 2022 +0000, Zebediah Figura wrote:
That sounds believable, although it'd be nice to be sure. More concerning to me, though, is that currently we can execute the "wrong" DPC if the callback was triggered during/after a concurrent KeSetTimer().
Here is a quote from the MS documentation: _"However, if the timer is about to expire when KeCancelTimer is called, expiration might occur before KeCancelTimer has a chance to access the time queue, in which case signaling and DPC queuing will occur."_ https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/using-a-cu...
And also: _"If a second driver routine calls KeSetTimer or KeSetTimerEx to run the same CustomTimerDpc routine before the interval specified by the first caller expires, the CustomTimerDpc routine is run only after the interval specified by the second caller expires. In these circumstances, the CustomTimerDpc does none of the work for which the first routine called KeSetTimer or KeSetTimerEx."_
In general they warn of many race conditions that can occur when using DPCs with kernel timers, and that it is responsibility of the developer to avoid them.
On Wed Nov 30 15:37:41 2022 +0000, Ivo Ivanov wrote:
If callbacks are pending at this point, then they've been triggered
and are just waiting for thread space to be executed. The purpose of the call here is exactly that - to cancel any pending callbacks that are about to be executed.
On the other hand, this isn't going to stop callbacks that are already
in process. Yes. According to the MS documentation for SetThreadpoolTimer(): _"In some cases, callback functions might run after an application closes the threadpool timer."_ (Remarks section) https://learn.microsoft.com/en-us/windows/win32/api/threadpoolapiset/nf-thre... Although they seem to imply that this includes also the queued callbacks, not just those in the execute phase: (pftDueTime parameter) _"If this parameter is NULL, the timer object will cease to queue new callbacks (but callbacks already queued will still occur)."_ And also in the Remarks section of CloseThreadpoolTimer(): _"The timer object is freed immediately if there are no outstanding callbacks; otherwise, the timer object is freed asynchronously after the outstanding callback functions complete."_ https://learn.microsoft.com/en-us/windows/win32/api/threadpoolapiset/nf-thre... Since the tp_object_cancel() removes pending callbacks, it can probably be removed. It won't make much of a difference though, and I think it makes more sense to cancel the pending callbacks there.
What I'm trying to say is I don't see why cancelling pending callbacks is going to *help* anything. In general I can see two reasons, in a situation like this, to cancel or flush some queued event:
* in order to avoid letting that event execute after the cancel
* in order to prevent anything from taking extra time waiting for those queued events to be processed
Neither of these seem to apply here. The former can't apply because the race is still present, as has been described. The latter doesn't seem like it should apply either, since there's nothing actually waiting for those callbacks to execute.
On Wed Nov 30 16:12:36 2022 +0000, Ivo Ivanov wrote:
Here is a quote from the MS documentation: _"However, if the timer is about to expire when KeCancelTimer is called, expiration might occur before KeCancelTimer has a chance to access the time queue, in which case signaling and DPC queuing will occur."_ https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/using-a-cu... And also: _"If a second driver routine calls KeSetTimer or KeSetTimerEx to run the same CustomTimerDpc routine before the interval specified by the first caller expires, the CustomTimerDpc routine is run only after the interval specified by the second caller expires. In these circumstances, the CustomTimerDpc does none of the work for which the first routine called KeSetTimer or KeSetTimerEx."_ In general they warn of many race conditions that can occur when using DPCs with kernel timers, and that it is responsibility of the developer to avoid them.
Sure, but those are specific races, and driver developers might expect and be able to handle them. On the other hand, I could easily see a driver crashing if a DPC that it thought was only going to execute at most once ends up executing twice. Unless Windows actually does that, we should probably try to avoid it. Which shouldn't be too difficult, I think.
(Note also that the Microsoft documentation can often be inaccurate, or incomplete. Programs may also depend on guarantees weaker or stronger than the documentation gives.)
On Wed Nov 30 16:59:15 2022 +0000, Zebediah Figura wrote:
What I'm trying to say is I don't see why cancelling pending callbacks is going to *help* anything. In general I can see two reasons, in a situation like this, to cancel or flush some queued event:
- in order to avoid letting that event execute after the cancel
- in order to prevent anything from taking extra time waiting for those
queued events to be processed Neither of these seem to apply here. The former can't apply because the race is still present, as has been described. The latter doesn't seem like it should apply either, since there's nothing actually waiting for those callbacks to execute.
I'm trying to eliminate those callbacks that are still in the submission phase (num_pending_callbacks). Those that have entered the execution phase (num_associated_callbacks) in tp_object_execute() cannot be stopped and will complete. Anyway, more than one pending callback per timer would be extremely rare, so no problem reverting that change.
On the other hand, I could easily see a driver crashing if a DPC that it thought was only going to execute at most once ends up executing twice.
Yeah, that could be a problem. Callbacks can finish executing later than timer reset/cancel, as shown by the test, so at least that is expected behavior.