Commit 15483b1a126 (server: Allow calling async_handoff() with status code STATUS_ALERTED., 2022-02-10) introduced the set_async_direct_result handler which calls async_set_initial_status().
However, the async_set_initial_status() call does nothing since async->terminated is set, leaving the async in a confusing state (unknown_status = 1 but pending/completed).
So far, this issue is unlikely to have been a problem in practice for the following reasons:
1. async_set_initial_status() would have unset unknown_status, but it remains set instead. This is usually not a problem, since unknown_status is usually ever read by code paths effectively unreachable for non-device (e.g. socket) asyncs.
It would still potentially allow set_async_direct_result to be called multiple times, but it wouldn't actually happen in practice unless something goes wrong.
2. async_set_initial_status() would have set initial_status; however, it is left with the default value STATUS_PENDING. If the actual status is something other than that, the handler closes the wait handle and async_satisfied (the only realconsumer of initial_status) would never be called anyway.
For reasons above, this issue is not effectively observable or testable. Nonetheless, the current code does leave the async object in an inconsistent state.
Fix this by removing the !async->terminated check in async_set_initial_status().
Signed-off-by: Jinoh Kang jinoh.kang.kr@gmail.com ---
Notes: v1 -> v2: no changes v2 -> v3: no changes v3 -> v4: reuse async_set_initial_status() per feedback
server/async.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/server/async.c b/server/async.c index 7d0ff10f7a4..e92d4583c1e 100644 --- a/server/async.c +++ b/server/async.c @@ -303,11 +303,8 @@ struct async *create_async( struct fd *fd, struct thread *thread, const async_da void async_set_initial_status( struct async *async, unsigned int status ) { assert( async->unknown_status ); - if (!async->terminated) - { - async->initial_status = status; - async->unknown_status = 0; - } + async->initial_status = status; + async->unknown_status = 0; }
void set_async_pending( struct async *async )
Right, but I think it's possible to trigger this assert now. For a rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
(And as mentioned, we should probably be calling async_set_initial_status() in async_set_result, in which case this could be triggered much less pathologically by the controlling process dying.)
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Further investigation revealed some other bugs, which is tangentially related:
1. An async may time out before the client calls set_async_direct_result to report back the status of initial I/O. In this case, the timeout is ignored. 2. There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns. Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs. This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs). - Timeout and cancellation works as expected: set_async_direct_result will treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Any thoughts?
(And as mentioned, we should probably be calling async_set_initial_status() in async_set_result, in which case this could be triggered much less pathologically by the controlling process dying.)
That has been split off as patch 2/8.
On Sat, Mar 5, 2022, 4:05 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a
rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Further investigation revealed some other bugs, which is tangentially related:
- An async may time out before the client calls set_async_direct_result
to report back the status of initial I/O. In this case, the timeout is ignored. 2. There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns. Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs.
To make it clear, it should be "restartable" vs. "non-restartable" asyncs. See below...
This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device
async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs).
Actually it's possible for devive asyncs to be in alerted state, but note that it is *not* supposed to be restartable.
- Timeout and cancellation works as expected: set_async_direct_result will
treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Any thoughts?
(And as mentioned, we should probably be calling
async_set_initial_status() in async_set_result, in which case this could be triggered much less pathologically by the controlling process dying.)
That has been split off as patch 2/8.
-- Sincerely, Jinoh Kang
On Sat, Mar 5, 2022, 4:47 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
On Sat, Mar 5, 2022, 4:05 AM Jinoh Kang jinoh.kang.kr@gmail.com wrote:
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a
rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Further investigation revealed some other bugs, which is tangentially related:
- An async may time out before the client calls set_async_direct_result
to report back the status of initial I/O. In this case, the timeout is ignored. 2. There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns. Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs.
To make it clear, it should be "restartable" vs. "non-restartable" asyncs. See below...
This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device
async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs).
Actually it's possible for devive asyncs to be in alerted state, but note that it is *not* supposed to be restartable.
On the second thought, the output copy APC call gets fake STATUS_ALERTED, so I think it doesn't apply here.
- Timeout and cancellation works as expected: set_async_direct_result will
treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Any thoughts?
(And as mentioned, we should probably be calling
async_set_initial_status() in async_set_result, in which case this could be triggered much less pathologically by the controlling process dying.)
That has been split off as patch 2/8.
-- Sincerely, Jinoh Kang
On 3/4/22 13:04, Jinoh Kang wrote:
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
Okay, I don't think that's possible for a device async. I'm still more comfortable removing the assert in this patch, though; I'm not sure that there isn't a way to wrangle a server crash...
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Sure, but the client is basically considered free to sabotage itself. We just can't let it crash the server, or mess with an object that it doesn't have the right access bits to.
Further investigation revealed some other bugs, which is tangentially related:
- An async may time out before the client calls set_async_direct_result to report back the status of initial I/O. In this case, the timeout is ignored.
I don't think this is a problem /a priori/, though. In fact, I'm somewhat concerned that if we let synchronous I/O time out before it even had a chance to start, we would break applications that expect it to work.
- There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns. Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
Yes, this has been a bug for a while. I don't remember if this was a problem before the async rework in 7.0. We probably need to "buffer" the cancellation request somehow, and process it once the async is restarted. (As above, I don't think we want to cancel the async if the client is going to report success to us.)
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs. This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs).
- Timeout and cancellation works as expected: set_async_direct_result will treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Any thoughts?
Maybe I'm missing something, but isn't that what the "alerted" field is for?
On 3/5/22 09:13, Zebediah Figura wrote:
On 3/4/22 13:04, Jinoh Kang wrote:
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
Okay, I don't think that's possible for a device async. I'm still more comfortable removing the assert in this patch, though; I'm not sure that there isn't a way to wrangle a server crash...
Understood.
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Sure, but the client is basically considered free to sabotage itself.
My concern was that it might interfere with the driver's assumptions; looks like it shouldn't happen, though.
We just can't let it crash the server, or mess with an object that it doesn't have the right access bits to.
Further investigation revealed some other bugs, which is tangentially related:
- An async may time out before the client calls set_async_direct_result to report back the status of initial I/O. In this case, the timeout is ignored.
I don't think this is a problem /a priori/, though.
The problem is that a timeout ignored *once* is ignored *forever* throughout the async's lifetime. We don't restore the timeout at all once the async enters the pending state.
An example scenario is as follows:
1. Client calls recv_socket. 2. Server sets async timeout and returns. 3. The timeout is triggered, but the async is already in terminated state; thus, it is ignored. 4. Client calls set_async_direct_result with status = STATUS_PENDING. 5. Now, the async will potentially linger forever now that the timeout has already been expired.
In fact, I'm somewhat concerned that if we let synchronous I/O time out before it even had a chance to start, we would break applications that expect it to work.
- There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns.
Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
Yes, this has been a bug for a while. I don't remember if this was a problem before the async rework in 7.0. We probably need to "buffer" the cancellation request somehow, and process it once the async is restarted. (As above, I don't think we want to cancel the async if the client is going to report success to us.)
Sounds like a great idea. Not sure which one is the best way to implement it, though. A few options come to my mind:
- Allow async_terminate() to override status if async->alerted is set, even if it is already terminated. - Add a separate status field for buffering cancellation requests.
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs. This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs).
- Timeout and cancellation works as expected: set_async_direct_result will treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Any thoughts?
Maybe I'm missing something, but isn't that what the "alerted" field is for?
You're right. Sorry for missing that. Also, the cancellation buffering approach you suggested makes this unnecessary anyway.
On 3/5/22 01:52, Jinoh Kang wrote:
On 3/5/22 09:13, Zebediah Figura wrote:
On 3/4/22 13:04, Jinoh Kang wrote:
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
Okay, I don't think that's possible for a device async. I'm still more comfortable removing the assert in this patch, though; I'm not sure that there isn't a way to wrangle a server crash...
Understood.
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Sure, but the client is basically considered free to sabotage itself.
My concern was that it might interfere with the driver's assumptions; looks like it shouldn't happen, though.
We just can't let it crash the server, or mess with an object that it doesn't have the right access bits to.
Further investigation revealed some other bugs, which is tangentially related:
- An async may time out before the client calls set_async_direct_result to report back the status of initial I/O. In this case, the timeout is ignored.
I don't think this is a problem /a priori/, though.
The problem is that a timeout ignored *once* is ignored *forever* throughout the async's lifetime. We don't restore the timeout at all once the async enters the pending state.
An example scenario is as follows:
- Client calls recv_socket.
- Server sets async timeout and returns.
- The timeout is triggered, but the async is already in terminated state; thus, it is ignored.
- Client calls set_async_direct_result with status = STATUS_PENDING.
- Now, the async will potentially linger forever now that the timeout has already been expired.
Ah, right. Unless I'm mistaken that's also a preëxisting problem.
In fact, I'm somewhat concerned that if we let synchronous I/O time out before it even had a chance to start, we would break applications that expect it to work.
- There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns.
Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
Yes, this has been a bug for a while. I don't remember if this was a problem before the async rework in 7.0. We probably need to "buffer" the cancellation request somehow, and process it once the async is restarted. (As above, I don't think we want to cancel the async if the client is going to report success to us.)
Sounds like a great idea. Not sure which one is the best way to implement it, though. A few options come to my mind:
- Allow async_terminate() to override status if async->alerted is set, even if it is already terminated.
We used to have this (before a5b6e90d48e), but I don't think this actually helps. If the client restarts we're back where we started. We could account for it there but I don't much like the added complexity.
- Add a separate status field for buffering cancellation requests.
This seems more desirable to me. We'll need to do the same thing with timeouts.
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs. This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs).
- Timeout and cancellation works as expected: set_async_direct_result will treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Any thoughts?
Maybe I'm missing something, but isn't that what the "alerted" field is for?
You're right. Sorry for missing that. Also, the cancellation buffering approach you suggested makes this unnecessary anyway.
On 3/8/22 03:11, Zebediah Figura wrote:
On 3/5/22 01:52, Jinoh Kang wrote:
On 3/5/22 09:13, Zebediah Figura wrote:
On 3/4/22 13:04, Jinoh Kang wrote:
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
Okay, I don't think that's possible for a device async. I'm still more comfortable removing the assert in this patch, though; I'm not sure that there isn't a way to wrangle a server crash...
Understood.
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Sure, but the client is basically considered free to sabotage itself.
My concern was that it might interfere with the driver's assumptions; looks like it shouldn't happen, though.
We just can't let it crash the server, or mess with an object that it doesn't have the right access bits to.
Further investigation revealed some other bugs, which is tangentially related:
- An async may time out before the client calls set_async_direct_result to report back the status of initial I/O. In this case, the timeout is ignored.
I don't think this is a problem /a priori/, though.
The problem is that a timeout ignored *once* is ignored *forever* throughout the async's lifetime. We don't restore the timeout at all once the async enters the pending state.
An example scenario is as follows:
- Client calls recv_socket.
- Server sets async timeout and returns.
- The timeout is triggered, but the async is already in terminated state; thus, it is ignored.
- Client calls set_async_direct_result with status = STATUS_PENDING.
- Now, the async will potentially linger forever now that the timeout has already been expired.
Ah, right. Unless I'm mistaken that's also a preëxisting problem.
Alright. I'm fixing this in separate patch series, then.
In fact, I'm somewhat concerned that if we let synchronous I/O time out before it even had a chance to start, we would break applications that expect it to work.
- There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns.
Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
Yes, this has been a bug for a while. I don't remember if this was a problem before the async rework in 7.0. We probably need to "buffer" the cancellation request somehow, and process it once the async is restarted. (As above, I don't think we want to cancel the async if the client is going to report success to us.)
Sounds like a great idea. Not sure which one is the best way to implement it, though. A few options come to my mind:
- Allow async_terminate() to override status if async->alerted is set, even if it is already terminated.
We used to have this (before a5b6e90d48e), but I don't think this actually helps. If the client restarts we're back where we started. We could account for it there but I don't much like the added complexity.
- Add a separate status field for buffering cancellation requests.
This seems more desirable to me. We'll need to do the same thing with timeouts.
Perhaps add cancel_status in struct async? It would hold STATUS_TIMEOUT or STATUS_CANCELLED (or whatever).
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs. This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs).
- Timeout and cancellation works as expected: set_async_direct_result will treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Any thoughts?
Maybe I'm missing something, but isn't that what the "alerted" field is for?
You're right. Sorry for missing that. Also, the cancellation buffering approach you suggested makes this unnecessary anyway.
On 3/8/22 09:33, Jinoh Kang wrote:
In fact, I'm somewhat concerned that if we let synchronous I/O time out before it even had a chance to start, we would break applications that expect it to work.
- There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns.
Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
Yes, this has been a bug for a while. I don't remember if this was a problem before the async rework in 7.0. We probably need to "buffer" the cancellation request somehow, and process it once the async is restarted. (As above, I don't think we want to cancel the async if the client is going to report success to us.)
Sounds like a great idea. Not sure which one is the best way to implement it, though. A few options come to my mind:
- Allow async_terminate() to override status if async->alerted is set, even if it is already terminated.
We used to have this (before a5b6e90d48e), but I don't think this actually helps. If the client restarts we're back where we started. We could account for it there but I don't much like the added complexity.
- Add a separate status field for buffering cancellation requests.
This seems more desirable to me. We'll need to do the same thing with timeouts.
Perhaps add cancel_status in struct async? It would hold STATUS_TIMEOUT or STATUS_CANCELLED (or whatever).
Sure, that sounds reasonable to me.
On 3/5/22 04:04, Jinoh Kang wrote:
On 3/5/22 02:47, Zebediah Figura wrote:
Right, but I think it's possible to trigger this assert now. For a rather pathological case, consider a client which calls set_async_direct_result on an async which is currently being serviced by a device driver and has not yet reached get_next_device_request.
For that to be possible, something has to terminate the async so that set_async_direct_result does not reject it. The easiest way to do so is to cancel it, or make it timeout somehow.
In fact, I think we have discovered an _existing_ bug: the client can use set_async_direct_result to interfere with cancelled asyncs (device-backed or not).
Further investigation revealed some other bugs, which is tangentially related:
- An async may time out before the client calls set_async_direct_result to report back the status of initial I/O. In this case, the timeout is ignored.
- There is a time window where alerted asyncs ignore IoCancel/IoCancelEx requests. If async_terminate( async, STATUS_ALERTED ) is called, the async will ignore any cancellation requests until the APC_ASYNC_IO routine returns. Note that this bug has existed *before* set_async_direct_result, and I think this one may need to be addressed first.
So what I think we need here is a mechanism to tell alerted asyncs apart from irreversibly-terminated asyncs. This will solve all of the problems above:
- It's no longer possible to call set_async_direct_result on a device async in the first place, since it will now reject irreversibly-terminated asyncs while still accepting alerted asyncs (which is not possible for device asyncs).
- Timeout and cancellation works as expected: set_async_direct_result will treat timed out asyncs as irreversibly-terminated. (the client needs to be modified to handle this case explicitly, though.)
Would it still be worth working on this? I couldn't find the bug report for this, though.