Sorry for the long reply, TLDR: behavior with D3DPOOL_SYSTEMMEM and D3DPOOL_MANAGED buffers is not uniform among vendors, except for DISCARD which is always ignored. Mapping D3DPOOL_SYSTEMMEM textures never blocks.
Read further below for the gritty details...
On Sat, Oct 27, 2018 at 3:44 PM Stefan Dösinger stefandoesinger@gmail.com wrote:
Am 27.10.2018 um 14:48 schrieb Henri Verbeet hverbeet@gmail.com:
As for the point you raise about map synchronisation, an implication of the above would be that mapping SYSTEMMEMORY buffers never blocks, beyond perhaps the draw-time upload.
That checks out, but interestingly only on Nvidia. I hacked a pair of QueryPerformanceCounter() calls around the second buffer map in the loop in test_map_synchronization() and augmented the test to also try with D3DPOOL_SYSTEMMEM buffers. On Nvidia, the whole Lock() / Unlock() dance is virtually instant (~2 μs according to QueryPerformanceCounter(), which is clearly at the limits of its resolution but nevertheless seems to give decently consistent and usable results) for D3DPOOL_SYSTEMMEM, regardless of the map flags. D3DPOOL_DEFAULT buffers behave as you would expect, i.e. mapping the buffer without flags right after a "large" draw blocks (it takes ~100 ms for me), NOOVERWRITE map is almost as fast as the D3DPOOL_SYSTEMMEM case (~2.5 μs), DISCARD takes just slightly longer (~20 μs). Ah, updating the D3DPOOL_SYSTEMMEM buffer with NOOVERWRITE (or otherwise) won't update the data in use by the draw, so the map is "synchronized" as far as the test is concerned. I also tested D3DPOOL_MANAGED and their results probably make sense too, although they aren't entirely what I expected. The no flags map case takes 1 ms for me, while the others usually take around 160 μs (although I have seen those sporadically take ~500 μs too). The 0 flags case in particular takes way longer than the SYSTEMMEM case but still 2 orders of magnitude less than the D3DPOOL_DEFAULT case. I guess one possible way to explain it is that, for managed buffers, the driver needs to copy the buffer back to system memory but doesn't need to wait for the draw to complete (at least on the GPU, I guess it might need to complete "dispatching" the draw to the GPU, whatever that means).
AMD, on the other hand, doesn't behave like that. Mapping a D3DPOOL_SYSTEMMEM buffer without the NOOVERWRITE flag does block to some degree. OTOH, mapping a SYSTEMMEM buffer with NOOVERWRITE is unsynchronized i.e. updating data used by the draw will affect the draw results. MANAGED buffers seem to have the same performance characteristics as SYSTEMMEM WRT maps, including NOOVERWRITE having a visible effect.
One test I'd find interesting would be to compare the performance characteristics of draws of various sizes out of huge MANAGED and SYSTEMMEMORY buffers.
Just one more hack to test_map_synchronization() and there you are :) I added one more QPC() call before the draw and restructured the test to create increasingly large buffers, both drawing just from a portion of the buffer and drawing from the whole buffer. On Nvidia, map time for D3DPOOL_MANAGED buffers is proportional with the size of the buffer and not affected by the triangle count. Draw time for D3DPOOL_SYSTEMMEM, on the other hand, is proportional with the triangle count and not affected by the buffer size. I think that also matches our understanding, with the driver only uploading the data strictly required by the draw. I haven't tested it yet but I assume that it works similarly for indexed draws, where d3d can exploit the min vertex index + vertex count to only upload the required subset of the vertex buffer. The only other significant change with larger buffers / draws is the map time for no flags D3DPOOL_DEFAULT buffer maps, which is proportional to the triangle count. That makes perfect sense, the map has to wait for the draw to complete. No other draw or map duration change in a significant manner with larger buffer sizes / triangle counts. On AMD, map duration is not measurably affected by buffer size or triangle count in any buffer pool - flag combination, aside from the D3DPOOL_DEFAULT no flags case, which blocks until the previous draw is completed. Not much to see with draw duration either, they are generally "instant" with DEFAULT and SYSTEMMEM pool buffers and take a bit longer (on the order of 100 μs) with MANAGED. No significant changes with different buffer size and triangle count values.
I guess all of this means that applications need to cope with both behaviors (or, more likely, don't care) and we can probably get away with pretty much anything.
From what I have seen in real games (e.g. World of Warcraft, Call of Duty Modern Warfare 2) textures are probably more interesting here than buffers. Both games use UpdateTexture with sysmem, D3DUSAGE_DYNAMIC source textures that they later map with DISCARD. When I worked on the command stream I honored that DISCARD flag, but I never wrote tests to show that it is correct to do so.
Good point. I wrote another quick test and it looks like D3DPOOL_SYSTEMMEM texture maps never block. Actually the DISCARD flag seems to be ignored in the case of D3DPOOL_SYSTEMMEM textures, texture data is unchanged from the previous map. This seems to be the case for both Nvidia and AMD. Perhaps interestingly, the UpdateTexture() call also never blocks, as far as I can see. Nothing surprising otherwise, except that apparently the readback after a draw seems to take longer if the texture data was actually changed compared to just mapped and not modified (e.g. it's pretty consistent at 2 - 2.5 ms vs 3.7 ms on Nvidia). I guess I shouldn't read too much into it.
To complete testing coverage of DISCARD, I also wrote a test for buffers. It turns out that DISCARD is ignored for SYSTEMMEM or MANAGED buffers, the map pointer and the buffer contents are unchanged after the DISCARD map.
If it's useful I can clean up those tests / hacks a bit and share them. Otherwise I'm probably going to make proper tests only for the DISCARD thing (i.e. what's not timing-related).