 
            Hi,
it is probably still a bit early, but nevertheless I would like to announce a feature I am currently working on and present you the first results. As some of you have already noticed (http://bugs.winehq.org/show_bug.cgi?id=35868) I've recently submitted some simple stub patches to add the dxva2 dll. The original purposes was to get a browser plugin working, which expects that this library is available, and otherwise refuses to run. The library exports some functions which are used by several applications (like VLC, Flash, Silverlight, ...) for GPU decoding. I started to work to on these functions and I want to present you a first result which you can see here: https://dl.dropboxusercontent.com/u/61413222/dxva2.png
This is actually the windows version of VLC playing a MPEG2 movie with GPU acceleration using DXVA2. My implementation of dxva2 uses the VAAPI on Linux to do the actual gpu decoding and should support AMD, Intel and NVIDIA cards.
Currently only MPEG2 decoding is supported as it is one of the easier codecs and other ones like H264 needs a lot more of buffers, which need be translated from the DXVA format to VAAPI. The second easiest codec to implement would be mpeg4 but as none of my graphic cards support mpeg4, I will most probably continue with VC-1. Anyway, I need to clean up the patches a bit as they add about 3000 new lines of code and test it with some other graphic cards before I can provide them, but there are also some problems, mostly d3d9 related, for which I would like to get your opinion.
The most difficult part is that DXVA2 is completely based on Direct3D9Device and Direct3DSurface9. The DXVA2 places the output images into a Surface and the applications locks the surface to get the output data or simply presents it to the screen. Although it would be much more efficient to directly blit the data in the graphic card at least VLC reads it back into system memory as the decoding and output pipeline are separated.
The problem is that I actually need to allocate twice the amount of memory for decoding. since I need to provide the Direct3D Surfaces to the application and I also need to provide buffers to VAAPI. This is not a big problem for mpeg2 since it only uses 3 output images as a B Frame can only reference the last and the next frame. Anyway, for H264 this is getting insane as it requires to store up to 16 output images so that i would need to allocate 16 VAAPI buffers and 16 Direct3D surfaces.
Currently i lock both kind of buffers after rendering a frame and do the synchronization in system memory, which is kind of inefficient depending on the surface type. My original idea was to do the copy in the graphic card as I can copy the image to a texture after decoding, but after Sebastian implemented this part we found out that the VAAPI implies a format conversion to RGB when copying data to a texture. This is actually a no go since VLC will refuse to use hardware acceleration when the output format is RGB. I also think it is kind of stupid to convert the RGB data back to YUV so that we end up with 3 color coder conversion (YUV->RGB->YUV->RGB). Some Intel developer wrote (see vaCopySurfaceGLX() at http://markmail.org/message/a3sav6q3dm5qvmat) that it would be possible to implement a copy in NV12 format for NVIDIA and Intel but not AMD. We could try to ask them to implement it, so that we can at least do it efficient for these two vendors.
Anyway, if other applications continue to copy the data back to system memory it might be better to instead wrap the VAAPI buffers as Direct3D9 surfaces so that we can directly map the VAAPI buffers when LockRect() is called instead of copying the data. Though this would imply problems when the applications tries to pass this interface to Present().
So what do the wined3d guys think? Is it better to convince the Intel developers to allow a copy in YUV format and copy the data directly into the texture of an Direct3D9 surface or wrap the VAAPI buffers as Direct3D9Surface and add some glue code when an applications tries to render it? Or do you have any better ideas?
Regards, Michael
 
            If this mail only contains this line my android client screwed up and I'll resend it tomorrow.
Am 31.03.2014 21:16 schrieb "Michael Müller" michael@fds-team.de:
This is actually the windows version of VLC playing a MPEG2 movie with GPU acceleration using DXVA2. My implementation of dxva2 uses the VAAPI on Linux to do the actual gpu decoding and should support AMD, Intel and NVIDIA cards.
Cool :-)
The most difficult part is that DXVA2 is completely based on Direct3D9Device and Direct3DSurface9. The DXVA2 places the output images into a Surface and the applications locks the surface to get the output data or simply presents it to the screen.
I did some introductory interface reading. If I understand it correctly, the dxva implementation / driver can control the pool of the input surface. Not only that, it actually creates the surface. Is that correct?
Afaics the output surface is either a dxva-created surface or a render target, is that correct?
Currently i lock both kind of buffers after rendering a frame and do the synchronization in system memory, which is kind of inefficient depending on the surface type.
If you are in system memory, is there an issue with using the d3d surface's memory as the vaapi input buffer? Also take note of user pointer surfaces / textures in d3d9ex.
My original idea was to do the copy in the graphic card as I can copy the image to a texture after decoding, but after Sebastian implemented this part we found out that the VAAPI implies a format conversion to RGB when copying data to a texture.
I do not know of any windows driver that supports YUV render targets (see above). Are dxva-created output surfaces video memory surfaces (or textures) or system memory surfaces? If they are sysmem surfaces you don't have a problem - the app either has to read back to sysmem or put up with an RGB surface / texture.
But even if you're copying to an RGB surface you have to get the GL texture from the IDirect3DSurface9 somehow. There may not even be one, if the surface is just the GL backbuffer. This is just a wine-internal problem though and should be solvable one way or another.
The vaapi-glx interface is also missing options for the mipmap level and cube map face. I guess you can ignore that until you find an application that wants a video decoded to the negative z face, mipmap level 2, of a rendertarget-capable d3d cube texture.
You may also want a way to make wined3d activate the device's WGL context. Right now that's not much of an issue if your code is called from the thread that created the device. The command stream will make this more difficult though.
Anyway, if other applications continue to copy the data back to system memory it might be better to instead wrap the VAAPI buffers as Direct3D9 surfaces so that we can directly map the VAAPI buffers when LockRect() is called instead of copying the data. Though this would imply problems when the applications tries to pass this interface to Present().
d3ddevice::present does not accept surfaces, but the problem remains for IDirect3DDevice9::UpdateSurface.
If the vaapi buffer has a constant address you can create a user memory d3d surface. I wouldn't be surprised if dxva was a motivation for user memory surfaces.
On a related note, we don't want any GLX code in wined3d, and probably not in any dxva.dll. The vaapi-glx.h header seems simple enough to use through WGL as it just says a context needs to be active. If not, you'll have to export a WGL version of vaapi from winex11.drv.
At some point we should think about equivalent interfaces on OSX and how to abstract between that and vaapi, but not today.
 
            Hi Stefan,
I did some introductory interface reading. If I understand it correctly, the dxva implementation / driver can control the pool of the input surface. Not only that, it actually creates the surface. Is that correct?
Afaics the output surface is either a dxva-created surface or a render target, is that correct?
All surfaces which are used in conjunction with the dxvapi are created through the CreateSurface command of the IDirectXVideoAccelerationService interface.
If you are in system memory, is there an issue with using the d3d surface's memory as the vaapi input buffer? Also take note of user pointer surfaces / textures in d3d9ex.
The surfaces are only used for storing the output image and they may have a different size than the buffers used in the vaapi. MPEG2 for example uses macro blocks which have a size of 16x16 Pixel and the size of a frame must therefore be dividable by 16. I noticed that VLC creates the surfaces with the size of the actual video while it initializes the decoders with a multiple of 16. Moreover I can not specify the address to which the output data should be copied I can only map the buffer at an address defined by vaapi and copy it manually.
I do not know of any windows driver that supports YUV render targets (see above). Are dxva-created output surfaces video memory surfaces (or textures) or system memory surfaces? If they are sysmem surfaces you don't have a problem - the app either has to read back to sysmem or put up with an RGB surface / texture.
DXVA supports both: direct rendering (called native mode) and reading it back to system memory ( see http://en.wikipedia.org/wiki/DirectX_Video_Acceleration#DXVA2_implementation... )
But even if you're copying to an RGB surface you have to get the GL texture from the IDirect3DSurface9 somehow. There may not even be one, if the surface is just the GL backbuffer. This is just a wine-internal problem though and should be solvable one way or another.
The vaapi-glx interface is also missing options for the mipmap level and cube map face. I guess you can ignore that until you find an application that wants a video decoded to the negative z face, mipmap level 2, of a rendertarget-capable d3d cube texture.
You may also want a way to make wined3d activate the device's WGL context. Right now that's not much of an issue if your code is called from the thread that created the device. The command stream will make this more difficult though.
We implemented some hack to get the opengl texture id of an D3D9 surface and to make the OpenGL context current by calling acquire_context(). As mentioned in the first email, the screenshot was created by using the copy-back approach.
If the vaapi buffer has a constant address you can create a user memory d3d surface. I wouldn't be surprised if dxva was a motivation for user memory surfaces.
On a related note, we don't want any GLX code in wined3d, and probably not in any dxva.dll. The vaapi-glx.h header seems simple enough to use through WGL as it just says a context needs to be active. If not, you'll have to export a WGL version of vaapi from winex11.drv.
At some point we should think about equivalent interfaces on OSX and how to abstract between that and vaapi, but not today.
We actually thought about a better solution on how to get around the problems. We could introduce a new surface type which uses the vaapi buffers as backend. If the users wants to read the memory back to system memory we can simply use the map function of vaapi and if the user wants to actually present the surface we could use the vaapi commands to convert it into a rgb texture with stuff like deinterlacing. This would allows us to implement native and copy back without doing unnecessary conversations or memory copies.
Do you think it would be okay if we try to add such a new type of surface? I think we would need to put the Vaapi commands into the x11 driver and export some functions which can be called d3d.
I also uploaded the patches in their current state so that you guys can take a look at what is actually needed to implement dxva2, but it is not yet in a state in which it could get upstream (we use a separate x11 connection, link statically against libva, inefficient algorithms for copying frames, ...)
You can find it here: https://github.com/compholio/wine-compholio-daily/tree/dxva2/patches/11-DXVA... on the dxva2 branch.
To test it with VLC you need:
1. 32 bit version of libav-dev 1.2.1
On Ubuntu you can get this version of libav-dev from my PPA: https://launchpad.net/~pipelight/+archive/libva (except for Trusty Thar which already provides this version)
2. Install the vaapi driver, for nvidia you need vdpau-va-driver
Make sure that vainfo (apt-get install vainfo) shows the MPEG2 VLD decoder.
3. You also need to apply this nasty hack to get around a problem with VLC and Direct3D: http://ix.io/bo5
4. Set the wine prefix to Vista as DXVA2 is only available in >= Vista
5. Install the current git version of VLC (the stable version has a bug in the DXVA2 code which breaks the decoding of P and B Frames). You can grab it here: http://nightlies.videolan.org/build/win32/last/
( See https://trac.videolan.org/vlc/ticket/10868 for more information about the bug. It took me quite some time to figure out that this bug is in VLC, and not in my code... )
6. Start VLC and enable DXVA2 in the Input/Codecs options. Test it :-)
I did not try it on anything else than nvidia yet and there is some untested code in the patches which is not supported by the vdpau wrapper, so that it may break on other graphic cards.
For other users that want to try out the patchset and expect a huge performance boost: I have to disappoint you! During my tests it was still slower than CPU decoded video data, but I expect a better performance after all the copy-overhead has been removed, and especially for other codecs like H264 the performance boost should be easier to notice. ;-)
Michael
 
            Am 01.04.2014 um 01:48 schrieb Michael Müller michael@fds-team.de:
All surfaces which are used in conjunction with the dxvapi are created through the CreateSurface command of the IDirectXVideoAccelerationService interface.
IDirectXVideoProcessor::VideoProcessBlt (http://msdn.microsoft.com/en-us/library/windows/desktop/ms697022(v=vs.85).as...) mentions that it also accepts user-created surfaces with D3DUSAGE_RENDERTARGET, but it seems that this is a different interface.
Moreover I can not specify the address to which the output data should be copied I can only map the buffer at an address defined by vaapi and copy it manually.
Does the va-api guarantee that the address is the same every time you map it?
DXVA supports both: direct rendering (called native mode) and reading it back to system memory ( see http://en.wikipedia.org/wiki/DirectX_Video_Acceleration#DXVA2_implementation... )
Can you call IDirectXVideoAccelerationService::CreateSurface to create a DXVA2_VideoDecoderRenderTarget surface with a format of NV12 (or another non-rgb format) surface that is in D3DPOOL_DEFAULT? It looks like the API allows it in theory, but I wonder if this works in practice. If DXVA2_VideoDecoderRenderTarget implies D3DUSAGE_RENDERTARGET creating such a surface will not be possible.
Even if the create call succeeded look at the details of the surface you got. Call IDirect3DSurface9::GetDesc to check its pool, format and usage. Check if you can Lock it. See if it has a texture container. I guess you can also compare the vtable to that of a surface created with IDirect3DDevice9::CreateOffscreenPlainSurface to see if dxva has created some sort of wrapper surface. (I doubt it. It would be asking for strange bugs).
 
            On 31 March 2014 21:14, Michael Müller michael@fds-team.de wrote:
This is actually the windows version of VLC playing a MPEG2 movie with GPU acceleration using DXVA2. My implementation of dxva2 uses the VAAPI on Linux to do the actual gpu decoding and should support AMD, Intel and NVIDIA cards.
Which APIs did you consider for this? What were the various issues? What made you choose VAAPI in the end? Do you have tests for how the dxva / d3d9 interface is supposed to work?
 
            Hi Henri,
Which APIs did you consider for this? What were the various issues? What made you choose VAAPI in the end? Do you have tests for how the dxva / d3d9 interface is supposed to work?
I mostly considered VAAPI and VDPAU for this as they both offer support for multiple vendors.
VDPAU has native support for nvidia, amd (only open source driver) and S3 but not for intel. There is a OpenGL backend for VDPAU which can be used on Intel graphic cards but I expect the video decode engine of Intel to reach a better performance than an implementation which is completely based on OpenGL.
VAAPI has native support for intel, crystal HD decoder, S3 and can also use VDPAU and the amds proprietary XvBA interface. It simply supports everything VDPAU does plus Intel and the proprietary AMD driver. I think that a library which offers support for the most graphic cards is the best possible option for Wine as we do not want to implement these decoders multiple times.
The only issues I encountered so far with VAAPI is that not all backend support all commands and you sometimes get an unimplemented error. This is not a big problem in the most cases as you can use other commands to achieve the same, the only real issue I found so far is that the vdpau wrapper does not allow setting or querying for the native image format of a codec. I can force a yuv 420 format, but i can not set or query whether it is stored as NV12 or YV12. When mapping the image it is possible to define a image format but I don't know whether a conversation is done or if is the raw decoded data. It may be necessary that we actually take a look at vdpau library and hard code some values if the vdpau backend is used to avoid conversations between formats.
There are currently no tests so far. The reason for this is that mingw does not support the dxva header files and you can not use the hardware decoder in a VM. So I basically wrote some test code in MSVC and tested it on an old laptop which is running Windows 7 with a nvidia 9800 GTS.
VLC wrote some header files for mingw which they use to cross compile VLC, but it does not offer everything we need ( http://www.videolan.org/developers/vlc/modules/codec/avcodec/dxva2.c ). Maybe whe can use the wine header file or ship a more complete version to allow cross compiling.
If you want to know how dxva2 is used by applications I would suggest you to take a look at these 3 files used by VLC to decode mpeg2 using dxva2:
http://www.videolan.org/developers/vlc/modules/codec/avcodec/dxva2.c http://www.ffmpeg.org/doxygen/1.0/dxva2__mpeg2_8c-source.html http://www.ffmpeg.org/doxygen/1.0/dxva2_8c-source.html
VLC initializes DXVA2, creates the surfaces and passes them to avcodec for decoding and storing the result images. Since you need a complete mpeg2 bitstream decoder to gather the information for decoding it is not very easy to create a small example code.
@Stefan Dösinger: I will try to do some tests this evening when I am at home.
Michael
 
            Hi Michael,
I had a quick look over your patches. As Henri pointed figuring out the interaction between dxva2 / d3d is important. I would like to add to also study the interface between the GPU driver (maybe that's through an awful ExtEscape). The current code uses the VAAPI X11 APIs fro dxva2.dll. This code should be moved in some way or another to the display driver, winex11.
Interaction between the display driver and dxva2 could be inspired by the Windows way (if documented well), but it may be too complex and you need your own mechanism. Depending on how dxva2 and d3d can interact, you may want to use OpenGL and probably need your own WGL extension (e.g. something similar to the VDPAU opengl interop extension).
Whichever video decoding library is better is hard to say. Over the years vdpau felt like the better library to me (it felt like it took years to get libva in some usable shape and I'm not sure how stable it is now.) You mention that vdpau on Intel uses opengl. From my understanding it does use the hw decoder and then uses some form of gl/glx buffer sharing, which is not bad. They also defined a reasonable GL extension, would could be ported to Wine. Some more testing may be needed. If worst came to worst, maybe allow multiple APIs to be used, though less than ideal, but we have multiple backends for other features (xrandr / xvidmode).
On a sidenote, OpenCL also allows sharing of buffers with the video decoders using directx. It may give some clues on how it works in windows and may give clues on how to do it for wiine. Long-term we may also want to support this opencl extension, assuming it is widely used (I doubt that right now).
I would urge to focus on the driver and d3d interaction.
Thanks, Roderick
On Tue, Apr 1, 2014 at 6:50 AM, Michael Müller michael@fds-team.de wrote:
Hi Henri,
Which APIs did you consider for this? What were the various issues? What made you choose VAAPI in the end? Do you have tests for how the dxva / d3d9 interface is supposed to work?
I mostly considered VAAPI and VDPAU for this as they both offer support for multiple vendors.
VDPAU has native support for nvidia, amd (only open source driver) and S3 but not for intel. There is a OpenGL backend for VDPAU which can be used on Intel graphic cards but I expect the video decode engine of Intel to reach a better performance than an implementation which is completely based on OpenGL.
VAAPI has native support for intel, crystal HD decoder, S3 and can also use VDPAU and the amds proprietary XvBA interface. It simply supports everything VDPAU does plus Intel and the proprietary AMD driver. I think that a library which offers support for the most graphic cards is the best possible option for Wine as we do not want to implement these decoders multiple times.
The only issues I encountered so far with VAAPI is that not all backend support all commands and you sometimes get an unimplemented error. This is not a big problem in the most cases as you can use other commands to achieve the same, the only real issue I found so far is that the vdpau wrapper does not allow setting or querying for the native image format of a codec. I can force a yuv 420 format, but i can not set or query whether it is stored as NV12 or YV12. When mapping the image it is possible to define a image format but I don't know whether a conversation is done or if is the raw decoded data. It may be necessary that we actually take a look at vdpau library and hard code some values if the vdpau backend is used to avoid conversations between formats.
There are currently no tests so far. The reason for this is that mingw does not support the dxva header files and you can not use the hardware decoder in a VM. So I basically wrote some test code in MSVC and tested it on an old laptop which is running Windows 7 with a nvidia 9800 GTS.
VLC wrote some header files for mingw which they use to cross compile VLC, but it does not offer everything we need ( http://www.videolan.org/ developers/vlc/modules/codec/avcodec/dxva2.c ). Maybe whe can use the wine header file or ship a more complete version to allow cross compiling.
If you want to know how dxva2 is used by applications I would suggest you to take a look at these 3 files used by VLC to decode mpeg2 using dxva2:
http://www.videolan.org/developers/vlc/modules/codec/avcodec/dxva2.c http://www.ffmpeg.org/doxygen/1.0/dxva2__mpeg2_8c-source.html http://www.ffmpeg.org/doxygen/1.0/dxva2_8c-source.html
VLC initializes DXVA2, creates the surfaces and passes them to avcodec for decoding and storing the result images. Since you need a complete mpeg2 bitstream decoder to gather the information for decoding it is not very easy to create a small example code.
@Stefan Dösinger: I will try to do some tests this evening when I am at home.
Michael
 
            On 1 April 2014 16:45, Roderick Colenbrander thunderbird2k@gmail.com wrote:
to Wine. Some more testing may be needed. If worst came to worst, maybe allow multiple APIs to be used, though less than ideal, but we have multiple backends for other features (xrandr / xvidmode).
I expect that at some point CodeWeavers may be interested in having this work on MacOS as well, so in that regard it seems likely that we'll end up with multiple backends.
 
            On 1 April 2014 15:50, Michael Müller michael@fds-team.de wrote:
Which APIs did you consider for this? What were the various issues? What made you choose VAAPI in the end? Do you have tests for how the dxva / d3d9 interface is supposed to work?
I mostly considered VAAPI and VDPAU for this as they both offer support for multiple vendors.
VDPAU has native support for nvidia, amd (only open source driver) and S3 but not for intel. There is a OpenGL backend for VDPAU which can be used on Intel graphic cards but I expect the video decode engine of Intel to reach a better performance than an implementation which is completely based on OpenGL.
VAAPI has native support for intel, crystal HD decoder, S3 and can also use VDPAU and the amds proprietary XvBA interface. It simply supports everything VDPAU does plus Intel and the proprietary AMD driver. I think that a library which offers support for the most graphic cards is the best possible option for Wine as we do not want to implement these decoders multiple times.
I don't have a particularly strong preference for one or the other, provided it works with the free drivers, but I was hoping for more of an analysis of the actual APIs, and how well they fit with DXVA. That does require figuring out all the D3D / DXVA interactions first, of course. In terms of vendor support, if it comes down to it, I don't think fglrx support is a very convincing argument, and I'm sure Intel would happily accept patches for e.g. VDPAU support. Did you look into OpenMAX at all?
There are currently no tests so far. The reason for this is that mingw does not support the dxva header files and you can not use the hardware decoder in a VM. So I basically wrote some test code in MSVC and tested it on an old laptop which is running Windows 7 with a nvidia 9800 GTS.
VLC wrote some header files for mingw which they use to cross compile VLC, but it does not offer everything we need ( http://www.videolan.org/developers/vlc/modules/codec/avcodec/dxva2.c ). Maybe whe can use the wine header file or ship a more complete version to allow cross compiling.
I'm not sure how you're currently writing tests, but unless you have a good reason not to, you probably should be using "make crosstest".



