[PATCH 0/3] MR617: server: Always prefer synchronous I/O in nonblocking mode. (#53486)

List overview All Threads

newer

older

Re: [PATCH v7 0/1] MR672: Common...

Re: [PATCH v3 0/4] MR681: evr:...

Jinoh Kang (＠iamahuman)

9 Aug 2022 9 Aug '22

2:53 p.m.

foobar2000.exe's UPnP Media Renderer component (foo_out_upnp.dll) expects that, if a select() call completes successfully with a non-empty writefds set, any immediately following send() call on a socket in the writefds set never fails with WSAEWOULDBLOCK.

On Wine, the Winsock select() and send() implementations both call the Unix poll(2) under the hood to test if I/O is possible on the socket. As it turns out, it's entirely possible that Unix poll() may yield POLLOUT on the first call (for select) but *not* the second (for send), even if no send() call has been made in the meanwhile.

On Linux (as of v5.19), a connected (ESTABLISHED) TCP socket that has not been shut down indicates the (E)POLLOUT condition only if the ratio of sk_wmem_queued (the amount of bytes queued in the send buffer) to sk_sndbuf (the size of send buffer size itself, which can be retrieved via SO_SNDBUF) is below a certain threshold. Therefore, a falling edge in POLLOUT can be triggered due to a number of reasons:

1. TCP retransmission and control packets (e.g. MTU probing). Such packets share the same buffer with application-initiated packets, and thus counted in sk_wmem_queued_add just like application data. See also: sk_wmem_queued_add() callers (Linux 5.19).

2. Memory pressure. This causes sk_sndbuf to shrink. See also: sk_stream_moderate_sndbuf() callers (Linux 5.19).

Fix this by always attempting synchronous I/O first if the nonblocking flag is set.

Note: for diagnosis, `getsockopt(fd, SOL_SOCKET, SO_MEMINFO, ...)` can be used to retrieve both sk_wmem_queued (the amount of bytes queued in the send buffer) and sk_sndbuf (the size of the send buffer itself, which can also be retrieved via SO_SNDBUF).

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/617

Show replies by date

Jinoh Kang

9 Aug 9 Aug

2:53 p.m.

New subject: [PATCH 1/3] server: Use POLLIN instead of POLLPRI if the socket is in oobinline mode.

From: Jinoh Kang jinoh.kang.kr@gmail.com

--- server/sock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/server/sock.c b/server/sock.c index 7d7e470be28..caa3724eb59 100644 --- a/server/sock.c +++ b/server/sock.c @@ -3470,7 +3470,7 @@ DECL_HANDLER(recv_socket) */ struct pollfd pollfd; pollfd.fd = get_unix_fd( sock->fd ); - pollfd.events = req->oob ? POLLPRI : POLLIN; + pollfd.events = req->oob && !is_oobinline( sock ) ? POLLPRI : POLLIN; pollfd.revents = 0; if (poll(&pollfd, 1, 0) >= 0 && pollfd.revents) {

-- GitLab https://gitlab.winehq.org/wine/wine/-/merge_requests/617

Jinoh Kang

2:53 p.m.

New subject: [PATCH 2/3] server: Use check_fd_events() instead of calling poll() directly.

From: Jinoh Kang jinoh.kang.kr@gmail.com

--- server/sock.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/server/sock.c b/server/sock.c index caa3724eb59..7b5bb187aa0 100644 --- a/server/sock.c +++ b/server/sock.c @@ -3468,11 +3468,7 @@ DECL_HANDLER(recv_socket) * asyncs will not consume all available data; if there's no data * available, the current request won't be immediately satiable. */ - struct pollfd pollfd; - pollfd.fd = get_unix_fd( sock->fd ); - pollfd.events = req->oob && !is_oobinline( sock ) ? POLLPRI : POLLIN; - pollfd.revents = 0; - if (poll(&pollfd, 1, 0) >= 0 && pollfd.revents) + if (check_fd_events( sock->fd, req->oob && !is_oobinline( sock ) ? POLLPRI : POLLIN )) { /* Give the client opportunity to complete synchronously. * If it turns out that the I/O request is not actually immediately satiable, @@ -3568,11 +3564,7 @@ DECL_HANDLER(send_socket) * asyncs will not consume all available space; if there's no space * available, the current request won't be immediately satiable. */ - struct pollfd pollfd; - pollfd.fd = get_unix_fd( sock->fd ); - pollfd.events = POLLOUT; - pollfd.revents = 0; - if (poll(&pollfd, 1, 0) >= 0 && pollfd.revents) + if (check_fd_events( sock->fd, POLLOUT )) { /* Give the client opportunity to complete synchronously. * If it turns out that the I/O request is not actually immediately satiable,

-- GitLab https://gitlab.winehq.org/wine/wine/-/merge_requests/617

Jinoh Kang

2:53 p.m.

New subject: [PATCH 3/3] server: Always prefer synchronous I/O in nonblocking mode.

From: Jinoh Kang jinoh.kang.kr@gmail.com

2. Memory pressure. This causes sk_sndbuf to shrink. See also: sk_stream_moderate_sndbuf() callers (Linux 5.19).

Fix this by always attempting synchronous I/O first if the nonblocking flag is set.

Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=53486 --- server/sock.c | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/server/sock.c b/server/sock.c index 7b5bb187aa0..b45c3cda510 100644 --- a/server/sock.c +++ b/server/sock.c @@ -3468,11 +3468,19 @@ DECL_HANDLER(recv_socket) * asyncs will not consume all available data; if there's no data * available, the current request won't be immediately satiable. */ - if (check_fd_events( sock->fd, req->oob && !is_oobinline( sock ) ? POLLPRI : POLLIN )) + if (sock->nonblocking || + check_fd_events( sock->fd, req->oob && !is_oobinline( sock ) ? POLLPRI : POLLIN )) { /* Give the client opportunity to complete synchronously. * If it turns out that the I/O request is not actually immediately satiable, - * the client may then choose to re-queue the async (with STATUS_PENDING). */ + * the client may then choose to re-queue the async (with STATUS_PENDING). + * + * Note: If the nonblocking flag is set, we don't poll the socket + * and always opt for synchronous completion first. This is + * because the application has probably seen POLLIN already from a + * preceding select()/poll() call before it requested to receive + * data. + */ status = STATUS_ALERTED; } } @@ -3564,11 +3572,26 @@ DECL_HANDLER(send_socket) * asyncs will not consume all available space; if there's no space * available, the current request won't be immediately satiable. */ - if (check_fd_events( sock->fd, POLLOUT )) + if (sock->nonblocking || check_fd_events( sock->fd, POLLOUT )) { /* Give the client opportunity to complete synchronously. * If it turns out that the I/O request is not actually immediately satiable, - * the client may then choose to re-queue the async (with STATUS_PENDING). */ + * the client may then choose to re-queue the async (with STATUS_PENDING). + * + * Note: If the nonblocking flag is set, we don't poll the socket + * and always opt for synchronous completion first. This is + * because the application has probably seen POLLOUT already from a + * preceding select()/poll() call before it requested to send data. + * + * Furthermore, some applications expect that any send() call on a + * socket that has indicated a POLLOUT condition never fails with + * WSAEWOULDBLOCK. It's possible that Unix poll() may yield + * POLLOUT on the first call but not the second, even if no send() + * call has been made in the meanwhile. This can happen for a + * number of reasons; for example, the TCP/IP networking stack may + * decide to shrink the send buffer due to memory pressure, or fill + * up the send queue with TCP retransmission packets. + */ status = STATUS_ALERTED; } }

-- GitLab https://gitlab.winehq.org/wine/wine/-/merge_requests/617

Jinoh Kang (＠iamahuman)

10 Aug 10 Aug

3:18 p.m.

New subject: [PATCH 0/3] MR617: Draft: server: Always prefer synchronous I/O in nonblocking mode. (#53486)

Jinoh Kang (@iamahuman) commented about server/sock.c:

...

      * asyncs will not consume all available data; if there's no data
      * available, the current request won't be immediately satiable.
      */

```
   struct pollfd pollfd;
```

   pollfd.fd = get_unix_fd( sock->fd );

   pollfd.events = req->oob ? POLLPRI : POLLIN;

```
   pollfd.revents = 0;
```

   if (poll(&pollfd, 1, 0) >= 0 && pollfd.revents)

```
   if (sock->nonblocking ||
```

Should this actually be `(sock->nonblocking && !req->force_async)`?

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/617#note_6041

Zebediah Figura (she/her)

6:02 p.m.

On 8/9/22 09:53, Jinoh Kang (@iamahuman) wrote:

...

foobar2000.exe's UPnP Media Renderer component (foo_out_upnp.dll) expects that, if a select() call completes successfully with a non-empty writefds set, any immediately following send() call on a socket in the writefds set never fails with WSAEWOULDBLOCK.

On Wine, the Winsock select() and send() implementations both call the Unix poll(2) under the hood to test if I/O is possible on the socket. As it turns out, it's entirely possible that Unix poll() may yield POLLOUT on the first call (for select) but *not* the second (for send), even if no send() call has been made in the meanwhile.

On Linux (as of v5.19), a connected (ESTABLISHED) TCP socket that has not been shut down indicates the (E)POLLOUT condition only if the ratio of sk_wmem_queued (the amount of bytes queued in the send buffer) to sk_sndbuf (the size of send buffer size itself, which can be retrieved via SO_SNDBUF) is below a certain threshold. Therefore, a falling edge in POLLOUT can be triggered due to a number of reasons:

TCP retransmission and control packets (e.g. MTU probing). Such packets share the same buffer with application-initiated packets, and thus counted in sk_wmem_queued_add just like application data. See also: sk_wmem_queued_add() callers (Linux 5.19).

Memory pressure. This causes sk_sndbuf to shrink. See also: sk_stream_moderate_sndbuf() callers (Linux 5.19).

Doesn't this mean that we can get POLLOUT from the host but then be unable to write data? That sounds like a spec violation.

Jin-oh Kang

11 Aug 11 Aug

3:49 p.m.

On Thu, Aug 11, 2022, 3:02 AM Zebediah Figura (she/her) < zfigura@codeweavers.com> wrote:

...

Doesn't this mean that we can get POLLOUT from the host but then be unable to write data? That sounds like a spec violation.

Unset POLLOUT does not necessarily imply that sendmsg() will block. On Linux (as of v5.19), POLLOUT is signaled on a connected TCP socket (that has not shut down) only if sk_wmem_queued is at least two thirds of sk_sndbuf (see __sk_stream_is_writeable). In contrast, sendmsg() will happily accept however much buffer space is left.

This is seemingly related to why Linux will double whatever SO_SNDBUF value you set to the socket: Linux stores the bookkeeping data in the same buffer as the application data, so it needs to raise the "writability" threshold. Also, memory pressure does not occur often, so perhaps it's seemingly not a problem in practice. However, I agree that it's not ideal that sendmsg() could block even after POLLOUT has been signaled. I'll test again with (P)MTU discovery disabled.

That said, it looks like TCP retransmission does not actually result in increase of `sk_wmem_queued`. I'll edit accordingly in the next revision.

...

Jin-oh Kang

3:59 p.m.

On Fri, Aug 12, 2022, 12:49 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:

...

On Thu, Aug 11, 2022, 3:02 AM Zebediah Figura (she/her) < zfigura@codeweavers.com> wrote:

...
Doesn't this mean that we can get POLLOUT from the host but then be unable to write data? That sounds like a spec violation.

Unset POLLOUT does not necessarily imply that sendmsg() will block. On Linux (as of v5.19), POLLOUT is signaled on a connected TCP socket (that has not shut down) only if sk_wmem_queued is at least two thirds of sk_sndbuf (see __sk_stream_is_writeable). In contrast, sendmsg() will happily accept however much buffer space is left.

s/at least/at most/g. Also modulo rounding error in sk_stream_min_wspace.

...

Jin-oh Kang

19 Aug 19 Aug

10:35 p.m.

On Fri, Aug 12, 2022, 12:49 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:

...

On Thu, Aug 11, 2022, 3:02 AM Zebediah Figura (she/her) zfigura@codeweavers.com wrote:

...
Doesn't this mean that we can get POLLOUT from the host but then be unable to write data? That sounds like a spec violation.

Unset POLLOUT does not necessarily imply that sendmsg() will block. On Linux (as of v5.19), POLLOUT is signaled on a connected TCP socket (that has not shut down) only if sk_wmem_queued is at least two thirds of sk_sndbuf (see __sk_stream_is_writeable). In contrast, sendmsg() will happily accept however much buffer space is left.

This is seemingly related to why Linux will double whatever SO_SNDBUF value you set to the socket: Linux stores the bookkeeping data in the same buffer as the application data, so it needs to raise the "writability" threshold. Also, memory pressure does not occur often, so perhaps it's seemingly not a problem in practice. However, I agree that it's not ideal that sendmsg() could block even after POLLOUT has been signaled. I'll test again with (P)MTU discovery disabled.

Ok, I found the culprit: TCP fragmentation.

When Linux fragmentizes a sk_buff in the socket's TX queue, the header of the newly split out sk_buff object is counted against `sk_wmem_queued`.

Since fragmentation can happen at any moment (e.g. during transmission, slicing to window, and loss recovery), the application can observe increment of `sk_wmem_queued` and thus falling edge in POLLOUT without any apparant reason.

I've attached a test program (a TCP server) that demonstrates this scenario. To make sure that the program can observe fragmentation, it is recommended to connect to the server over a real network.

server $ gcc -O2 -o tcp-fragment-test tcp-fragment-test.c

If we carefully manipulate the buffer size and flags (e.g. MSG_EOR to prevent coalescing which later leads to re-fragmentation) so that it doesn't cause TCP fragmentation, the POLLOUT falling edge never occurs:

server $ ./tcp-fragment-test -p 1234 0.0.0.0 1234 client $ nc -v server 1234 > /dev/null Connection to server 1234 port [tcp/*] succeeded! server > Connection from <client> ^C server $

However, as soon as we enable fragmentation, things start to go a little flaky:

server $ ./tcp-fragment-test -p 1234 -L 131072 -b client $ nc -v server 1234 > /dev/null server > (snip) ticks 183502, seq 929617 wq anomaly _______v old-skmem:(r 0,rb 131072,t 0,tb 200192,f 58632,w 207608,o 0,bl 0,d 0) new-skmem:(r 0,rb 131072,t 0,tb 200192,f 57352,w 208888,o 0,bl 0,d 0) ticks 187506, seq 929617 wq anomaly _______v old-skmem:(r 0,rb 131072,t 0,tb 200192,f 81800,w 184440,o 0,bl 0,d 0) new-skmem:(r 0,rb 131072,t 0,tb 200192,f 80520,w 185720,o 0,bl 0,d 0) ticks 188587, seq 929617 wq anomaly _______v old-skmem:(r 0,rb 131072,t 0,tb 200192,f 90488,w 175752,o 0,bl 0,d 0) new-skmem:(r 0,rb 131072,t 0,tb 200192,f 89208,w 177032,o 0,bl 0,d 0) ^C server $

In order to test on a virtual Ethernet network instead, simply prepend ./veth.sh in front of the command. tc-netem(8) can also be used to emulate a real network condition.

Still, I maintain that this does not cause any issues in practice for Linux since the application can usually write some more data even if POLLOUT is unset--there's still enough room in buffer to keep send() from blocking.

Granted, if the fragmentation becomes too fine-grained (e.g. MSS or window size drops below the sk_buff overhead), the bookkeeping overhead prevails and the send() call may actually block. I'm not sure if this is actually possible, but even if this was the case, it's possible that the upstream Linux kernel would reject any fixes on the grounds that the application is expected to enable non-blocking mode when performing readiness-based I/O. Still, I'll try to raise this on LKML some time.

...

That said, it looks like TCP retransmission does not actually result in increase of `sk_wmem_queued`. I'll edit accordingly in the next revision.

Correction: TCP retransmission do not actually increase `sk_wmem_queued` per se, but it may indirectly do so via fragmentation (e.g. reduced MSS).

Jin-oh Kang

10:59 p.m.

On Sat, Aug 20, 2022, 7:35 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:

...

Granted, if the fragmentation becomes too fine-grained (e.g. MSS or window size drops below the sk_buff overhead), the bookkeeping overhead prevails and the send() call may actually block. I'm not sure if this is actually possible, but even if this was the case, it's possible that the upstream Linux kernel would reject any fixes on the grounds that the application is expected to enable non-blocking mode when performing readiness-based I/O. Still, I'll try to raise this on LKML some time.

Independent of upstream fixes, would it be worth to merge this patch sooner? I'm concerned that there probably are far more apps that are broken other than just fb2k, especially since the poll-and-IO pattern is quite common.

Zebediah Figura (she/her)

11:05 p.m.

On 8/19/22 17:59, Jin-oh Kang wrote:

...

On Sat, Aug 20, 2022, 7:35 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:

...
Granted, if the fragmentation becomes too fine-grained (e.g. MSS or window size drops below the sk_buff overhead), the bookkeeping overhead prevails and the send() call may actually block. I'm not sure if this is actually possible, but even if this was the case, it's possible that the upstream Linux kernel would reject any fixes on the grounds that the application is expected to enable non-blocking mode when performing readiness-based I/O. Still, I'll try to raise this on LKML some time.

Independent of upstream fixes, would it be worth to merge this patch sooner? I'm concerned that there probably are far more apps that are broken other than just fb2k, especially since the poll-and-IO pattern is quite common.

I think this patch makes sense regardless—even if it is a Linux bug (which is not perfectly clear to me), it's reasonable enough to work around, and the workaround is simple and makes sense. I hadn't signed off on it yet since you had said you were going to update the comment for accuracy.

Jin-oh Kang

11:10 p.m.

On Sat, Aug 20, 2022, 8:05 AM Zebediah Figura (she/her) < zfigura@codeweavers.com> wrote:

...

I think this patch makes sense regardless—even if it is a Linux bug (which is not perfectly clear to me), it's reasonable enough to work around, and the workaround is simple and makes sense. I hadn't signed off on it yet since you had said you were going to update the comment for accuracy.

Thanks for the reminder! I'll update it.

Zebediah Figura (＠zfigura)

16 Aug 16 Aug

10:13 p.m.

New subject: [PATCH 0/3] MR617: Draft: server: Always prefer synchronous I/O in nonblocking mode. (#53486) - approved

This merge request was approved by Zebediah Figura.

-- https://gitlab.winehq.org/wine/wine/-/merge_requests/617

1051

Age (days ago)

1061

Last active (days ago)

wine-gitlab@winehq.org

12 comments

5 participants

tags (0)

participants (5)

Jin-oh Kang
Jinoh Kang
Jinoh Kang (＠iamahuman)
Zebediah Figura (＠zfigura)
Zebediah Figura (she/her)