Re: [PATCH 0/3] MR617: server: Always prefer synchronous I/O in nonblocking mode. (#53486)

19 Aug 2022


      On Fri, Aug 12, 2022, 12:49 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
...
On Thu, Aug 11, 2022, 3:02 AM Zebediah Figura (she/her) zfigura@codeweavers.com wrote:
...
Doesn't this mean that we can get POLLOUT from the host but then be
unable to write data? That sounds like a spec violation.
Unset POLLOUT does not necessarily imply that sendmsg() will block. On Linux (as of v5.19), POLLOUT is signaled on a connected TCP socket (that has not shut down) only if sk_wmem_queued is at least two thirds of sk_sndbuf (see __sk_stream_is_writeable). In contrast, sendmsg() will happily accept however much buffer space is left.
This is seemingly related to why Linux will double whatever SO_SNDBUF value you set to the socket: Linux stores the bookkeeping data in the same buffer as the application data, so it needs to raise the "writability" threshold. Also, memory pressure does not occur often, so perhaps it's seemingly not a problem in practice. However, I agree that it's not ideal that sendmsg() could block even after POLLOUT has been signaled. I'll test again with (P)MTU discovery disabled.
Ok, I found the culprit: TCP fragmentation.
When Linux fragmentizes a sk_buff in the socket's TX queue, the header
of the newly split out sk_buff object is counted against
`sk_wmem_queued`.
Since fragmentation can happen at any moment (e.g. during
transmission, slicing to window, and loss recovery), the application
can observe increment of `sk_wmem_queued` and thus falling edge in
POLLOUT without any apparant reason.
I've attached a test program (a TCP server) that demonstrates this
scenario.  To make sure that the program can observe fragmentation, it
is recommended to connect to the server over a real network.
server $ gcc -O2 -o tcp-fragment-test tcp-fragment-test.c
If we carefully manipulate the buffer size and flags (e.g. MSG_EOR to
prevent coalescing which later leads to re-fragmentation) so that it
doesn't cause TCP fragmentation, the POLLOUT falling edge never
occurs:
server $ ./tcp-fragment-test -p 1234
0.0.0.0 1234
client $ nc -v server 1234 > /dev/null
Connection to server 1234 port [tcp/*] succeeded!
server >
Connection from <client>
^C
server $
However, as soon as we enable fragmentation, things start to go a little flaky:
server $ ./tcp-fragment-test -p 1234 -L 131072 -b
client $ nc -v server 1234 > /dev/null
server >
(snip)
ticks     183502, seq     929617 wq anomaly                   _______v
    old-skmem:(r      0,rb 131072,t      0,tb 200192,f  58632,w
207608,o      0,bl      0,d      0)
    new-skmem:(r      0,rb 131072,t      0,tb 200192,f  57352,w
208888,o      0,bl      0,d      0)
ticks     187506, seq     929617 wq anomaly                   _______v
    old-skmem:(r      0,rb 131072,t      0,tb 200192,f  81800,w
184440,o      0,bl      0,d      0)
    new-skmem:(r      0,rb 131072,t      0,tb 200192,f  80520,w
185720,o      0,bl      0,d      0)
ticks     188587, seq     929617 wq anomaly                   _______v
    old-skmem:(r      0,rb 131072,t      0,tb 200192,f  90488,w
175752,o      0,bl      0,d      0)
    new-skmem:(r      0,rb 131072,t      0,tb 200192,f  89208,w
177032,o      0,bl      0,d      0)
^C
server $
In order to test on a virtual Ethernet network instead, simply prepend
./veth.sh in front of the command. tc-netem(8) can also be used to
emulate a real network condition.
Still, I maintain that this does not cause any issues in practice for
Linux since the application can usually write some more data even if
POLLOUT is unset--there's still enough room in buffer to keep send()
from blocking.
Granted, if the fragmentation becomes too fine-grained (e.g. MSS or
window size drops below the sk_buff overhead), the bookkeeping
overhead prevails and the send() call may actually block. I'm not sure
if this is actually possible, but even if this was the case, it's
possible that the upstream Linux kernel would reject any fixes on the
grounds that the application is expected to enable non-blocking mode
when performing readiness-based I/O. Still, I'll try to raise this on
LKML some time.
...
That said, it looks like TCP retransmission does not actually result in increase of `sk_wmem_queued`. I'll edit accordingly in the next revision.
Correction: TCP retransmission do not actually increase
`sk_wmem_queued` per se, but it may indirectly do so via fragmentation
(e.g. reduced MSS).

2025

2024

2023

2022

Re: [PATCH 0/3] MR617: server: Always prefer synchronous I/O in nonblocking mode. (#53486)