On Fri, Aug 12, 2022, 12:49 AM Jin-oh Kang jinoh.kang.kr@gmail.com wrote:
On Thu, Aug 11, 2022, 3:02 AM Zebediah Figura (she/her) zfigura@codeweavers.com wrote:
Doesn't this mean that we can get POLLOUT from the host but then be unable to write data? That sounds like a spec violation.
Unset POLLOUT does not necessarily imply that sendmsg() will block. On Linux (as of v5.19), POLLOUT is signaled on a connected TCP socket (that has not shut down) only if sk_wmem_queued is at least two thirds of sk_sndbuf (see __sk_stream_is_writeable). In contrast, sendmsg() will happily accept however much buffer space is left.
This is seemingly related to why Linux will double whatever SO_SNDBUF value you set to the socket: Linux stores the bookkeeping data in the same buffer as the application data, so it needs to raise the "writability" threshold. Also, memory pressure does not occur often, so perhaps it's seemingly not a problem in practice. However, I agree that it's not ideal that sendmsg() could block even after POLLOUT has been signaled. I'll test again with (P)MTU discovery disabled.
Ok, I found the culprit: TCP fragmentation.
When Linux fragmentizes a sk_buff in the socket's TX queue, the header of the newly split out sk_buff object is counted against `sk_wmem_queued`.
Since fragmentation can happen at any moment (e.g. during transmission, slicing to window, and loss recovery), the application can observe increment of `sk_wmem_queued` and thus falling edge in POLLOUT without any apparant reason.
I've attached a test program (a TCP server) that demonstrates this scenario. To make sure that the program can observe fragmentation, it is recommended to connect to the server over a real network.
server $ gcc -O2 -o tcp-fragment-test tcp-fragment-test.c
If we carefully manipulate the buffer size and flags (e.g. MSG_EOR to prevent coalescing which later leads to re-fragmentation) so that it doesn't cause TCP fragmentation, the POLLOUT falling edge never occurs:
server $ ./tcp-fragment-test -p 1234 0.0.0.0 1234 client $ nc -v server 1234 > /dev/null Connection to server 1234 port [tcp/*] succeeded! server > Connection from <client> ^C server $
However, as soon as we enable fragmentation, things start to go a little flaky:
server $ ./tcp-fragment-test -p 1234 -L 131072 -b client $ nc -v server 1234 > /dev/null server > (snip) ticks 183502, seq 929617 wq anomaly _______v old-skmem:(r 0,rb 131072,t 0,tb 200192,f 58632,w 207608,o 0,bl 0,d 0) new-skmem:(r 0,rb 131072,t 0,tb 200192,f 57352,w 208888,o 0,bl 0,d 0) ticks 187506, seq 929617 wq anomaly _______v old-skmem:(r 0,rb 131072,t 0,tb 200192,f 81800,w 184440,o 0,bl 0,d 0) new-skmem:(r 0,rb 131072,t 0,tb 200192,f 80520,w 185720,o 0,bl 0,d 0) ticks 188587, seq 929617 wq anomaly _______v old-skmem:(r 0,rb 131072,t 0,tb 200192,f 90488,w 175752,o 0,bl 0,d 0) new-skmem:(r 0,rb 131072,t 0,tb 200192,f 89208,w 177032,o 0,bl 0,d 0) ^C server $
In order to test on a virtual Ethernet network instead, simply prepend ./veth.sh in front of the command. tc-netem(8) can also be used to emulate a real network condition.
Still, I maintain that this does not cause any issues in practice for Linux since the application can usually write some more data even if POLLOUT is unset--there's still enough room in buffer to keep send() from blocking.
Granted, if the fragmentation becomes too fine-grained (e.g. MSS or window size drops below the sk_buff overhead), the bookkeeping overhead prevails and the send() call may actually block. I'm not sure if this is actually possible, but even if this was the case, it's possible that the upstream Linux kernel would reject any fixes on the grounds that the application is expected to enable non-blocking mode when performing readiness-based I/O. Still, I'll try to raise this on LKML some time.
That said, it looks like TCP retransmission does not actually result in increase of `sk_wmem_queued`. I'll edit accordingly in the next revision.
Correction: TCP retransmission do not actually increase `sk_wmem_queued` per se, but it may indirectly do so via fragmentation (e.g. reduced MSS).