As you may see from the different email address, I am currently off work - I'll look deeper into your traces next week. Many thanks for generating them, anyway.
From a first glance, it seems that the app doesn't do overlapped recv().
(lpOverlapped & completion_func are always NULL), so there should actually be no difference (well I'm now using recvmsg() instead of recv(), and the overhead for 16 bit Winsock apps is definitely larger, but I cannot see why that should slow everything down that much). Perhaps I screwed up the blocking semantics of non-overlapped IO.
Have you generated the traces under exactly the same conditions? Can you explain what exactly you did?
Martin