Regarding the concern of storing the decoded strided data after finishing drawing: This is intentional, the decoded vertex declaration will remain valid after the draw is finished and the arrays loaded. Future draws can use it, if the state is not dirtified again.
This sounds like a good idea...
Wrt the upward references: When we have multithreading with multiple contexts we will need a per-context tracking of most of the stuff we have in the device now(last_was_rhw, ...). This structure does not exist yet, and I do not see a point in passing the device impl to every function when we can get it from the stateblock too.
I still think this caching stuff needs to go into its own structure [ maybe device->cache or something like that ]. I can see how it's a bit different from the standard stateblock data, since it's just a cache, rather than something that's set/get by the application - cache state vs configured state. It's not so clear to me that this is all per-thread data - the strided streams are directly tied to the vertex declaration and stream data, which is in turn shared across threads as part of the stateblock.
===========
Note: you just made drawPrimitiveTraceDataLocations into dead code. (previously called in d3d8 fixed function code path). I don't care much for this function, but please remove it if getting rid of its callers.