I took some time to look into this to see if there's extra overhead, and while I think there are some things we could do better in draw_primitive(), there's probably not very much. Depending on what's in frame, the CS spends the majority of its time in draw_primitive(). Probably about 20% of that is spent acquiring the GL context, 40% loading the RTVs, 20% in context_apply_draw_state(); the rest is difficult to measure. This is on a relatively powerful radeonsi machine, with the swap interval hacked to zero; the total frame time is probably about 9 ms in the scenes I'm testing.
I think we can potentially cut draw_primitive() down to 10% of its current overhead if none of the state changes, but when we're doing 5000 draw calls per frame, even that may be too much. We could potentially buffer in wined3d, perhaps making use of EXT_multi_draw_arrays, but as Henri pointed out on IRC, we'd have to do a fair amount of work to invalidate (less than in ddraw itself), and this sort of thing probably doesn't perform well in newer d3d versions on Windows anyway. So buffering in ddraw is probably the right way to go. I'll look at the patch itself anon.