 
            Hi, (warning: long mail)
Currently the wined3d code is doing more or less syncronous rendering, that means that a Direct3D function call from the app results directly results in the equivalent opengl call(s). There are a few issues with that:
* Multithreaded Direct3D: Opengl calls can only be done from the thread that owns the glX context. Direct3D calls can be done from any thread. Passing around the context is only possible with hacks(SetThreadContext or pthread_kill) and prone to deadlocks.
* Performance: Applications expect the 3D calls to return immediately so they can do other things while the gpu is rendering. GL works in the same way, so our Direct3D rendering functions should return almost immediately too, but due to the state changes and drawStridedSlow seem to cause gl to wait until the pipeline is empty.
My suggestion is to create a per-device thread which does the rendering and owns the thread, and the rendering calls only place some tokens into a queue and return immediately. This way the app gets the control back immediately and multithreaded direct3d is only about locking the queue correctly. The rendering thread and all rendering code would be in drawprim.c(and maybe a new file e.g. opengl_utils.c). The other files would contain no gl code.
Here are some more concrete suggestions for implementing this:
The pipeline is a block of memory with a fixed size(e.g. 64k, whatever), and the work orders that are placed in it consist of an opcode and any number of arguments. A NULL opcode means that the place is empty. When a new operation doesn't fit at the bottom of the pipeline we start again. A instruction pointer points at the opcode of the next instruction. If the next opcode is NULL that means that the pipeline is empty. When an instruction was executed the memory of the instruction is zeroed and the instruction pointer set to the next byte after the old instruction. A new instruction doesn't fit into the pipeline if it would overwrite nonzero memory, then we issue a warning and wait until some more space is free.
A little modification would be to fix the number of arguments to an opcode. Checking for emtpy instructions would be easier then because we only have to check if the address holding the operation code is NULL and not the whole memory when placing a new instruction, but on the other hand waste memory.
Of course we can HeapAlloc the instructions and place pointers to the allocated memory. That doesn't waste memory(maybe, depends on how HeapAlloc works), but imposes the overhead of regular HeapAlloc / HeapFree calls.
So what instructions would we need? Everything that issues GL calls. Here are some I could think of and some implementation thoughts:
SetRenderState: IWineD3DDevice::SetRenderState sets the update stateblock, and if not recording to a stateblock places a SETRENDERSTATE operation. Arguments are the render state to set and the value to set it to. When the instruction from the pipeline is exectured the value is set in the actual render stateblock and the gl state is updated with the code that is in setrenderstate already. IWineD3DDevice::GetRenderState returns the value from the update stateblock, so it is independent from the execution state of the pipe.
SetTextureStageState: Arguments are the stage, state and value, otherwise it is simmilar to SetRenderState
SetStreamSource, SetTexture, Set*Shader: Update the update state block, update the refcounts and if not recording queue a setting operation for the stream/texture/shader. This operation updates the render state block, but does not necessarilly change the gl setting(e.g. SetTexture requires texture coords in the vertex too) The Getters return the values stored in the update state block.
SetDisplayMode, GetDisplayMode: Not GL calls
SetClipPlane, SetMateral, SetLight, SetLightenable, SetTransform, MultiplyTransform, SetViewport: Pretty simple, update the updatestateblock, ...
SetFVF, SetVertexDeclaration: Updates the update stateblock and queues a SetDeclaration operation. The declaration is stored in the render stateblock and referenced for rendering. I'd suggest that the render thread should not deal with FVFs
Set*ShaderConstant: No idea
UpdateTexture, UpdateSurface: No idea either. Maybe relay to DirectDraw Blits
ApplyStateBlock: Compare the stateblock contents against the updatestateblock, update the updatestateblock and queue Set* commands for different ones
Surface Locking: Set up the local memory for the surface, and if necessary issue a command to read back the surface from gl. Wait for this command to be executed and wait until the last command referencing the surface is finished. If a surface is locked often keep the local memory copy to avoid flushing the whole pipeline for the readback command. When the surface memory is ready pass return
Surface unlocking: If necessary start converting the surface e.g. for color keying in a seperate thread and return. If the surface is used for drawing before the conversion is complete the rendering thread has to wait until the conversion is finished. Uploading the surface to gl is done during drawprim when the surface is used.
Vertex Buffer locking: Simmilar to surface locking. If neither NOOVERWRITE or DISCARD locking flags are provided wait until all rendering with the buffer is done. Then return the buffer data. We may have to give up the idea of mapping gl memory via glMapBuffer or we might have to wait for the whole pipeline to be executed to place a command for that.
Unlocking vertex buffers: If the semantics of the data is known start fixing up vertices in a seperate thread. When done fixing up the buffer place a preload command into the pipeline to load the buffer as early as possible, some gl implementations seem to need that. Again if the buffer is used for drawing until the conversion is done the drawing thread has to wait. Also convert buffers if no vbos are available to get rid of drawStridedSlow completely.
Drawprim: This is the most complex thing: First, check if all bound textures and vertex buffers are unlocked(Unit test!). Then increase the rendering reference counter of all textures and buffers(to count how often the object is used in the queue). Then queue a drawprim command and return. If drawing from a user pointer we either have to wait until drawing is done or create a copy before placing the call(this is my favorite, we can fixup colors too while we're at it)
Blits: Find out if the blit can be handled in opengl and queue a blit call(which will draw a textured quad). If gl can't handle that fall back to the gdi code, it will perform everything in software, from a gl perspective surface locks are done. This is slow, we will want to handle everything in gl.
GetDC: At the moment this is a LockRect from the gl pov, we may want to write a gl gdi driver which queues commands on the pipeline
Present: Queue a FLIP command and wait for the pipeline to be emptied, then return. Ideally the rendering is done when Present is called and present returns immediately.
Destroying objects: Wait until they aren't needed in the pipeline anymore in Release.
Open issues:
SetRenderTarget: Afaik those have their own gl context. Should we have a different pipeline or request the worker to switch to a different gl context? Synchonisation is a issue
Multiple swapchains: Simmilar issue
Anything I forgot?
How do we reference objects in the pipeline? For the start I'd suggest to use the implementation pointer, later we may want to replace it by handles(to avoid issues with the pointer size on 64 bit). See the roadmap.
My suggestion for the roadmap: 1) Start by protecting the ddraw, d3d8 and d3d9 objects with critical sections against race conditions(easy). 2) Move the code around in wined3d a bit, split up COM from GL stuff without actually changing the way rendering is done. 3) Add a stub pipeline and add code queing the command 4) Move the context into the worker thread and call the actual gl commands from there
Additional stuff that can be done if we feel like it: * Get rid of COM in wined3d * Move non-rendering things like Private Data, Stateblocks and Getters into ddraw, d3d8 and d3d9, leave only rendering code in wined3d. The software ddraw code has to stay in wined3d though * Adopt ddi, ddentry or whatever is used in windows xp / vista(Potential legal issues as these interfaces aren't well documented).
Stefan
 
            I already more or less mentioned this on IRC, but I think this sounds like a lot of trouble to implement and maintain, while I'm not quite convinced it will be worth it.

