Hi everyone, D3DXLoadSurfaceFromMemory is used to load a chunk of pixel data into a surface, performing any pixel format conversion from signed/unsigned ARGB, floating point or FourCC formats to any other format and also performs various types of filtering actions (one of copying, point filtering, linear filtering, triangular filtering and bov filtering and additionally dithering). I've partially implemented this function during my GSoC project, supporting only unsigned ARGB formats (with 1-4 bytes) and copying for now (supporting signed formats and point filtering is trivial to add to this version), but due to the huge functionality of the function it's a hard task to get the code right without either being somewhat slower than the native dll or having thousands of lines of code for each conversion/filtering scenario. My current approach is to implement various specific converters (which actually do converting AND filtering): - converting from and to "small" signed/unsigned ARGB formats - converting "large" ARGB formats to small ones - converting "large" ARGB formats to other large ARGB formats - converting DXT formats to other DXT formats - converting DXT formats to small ARGB formats - .... - additionally, I plan to a function for each filter which filters data when the source and destination formats are the same (which gives a huge performance boost which is e.g. necessary for mipmap generation) As you can see, that's still a great bunch of code, but we don't necessarily need to implement all of these (we could just implement the to/from small ARGB part for each format type and use a buffer if we need to convert from one "unusual" format to another unusual one). These converters will need to be duplicated across the various filter types. This architecture is the only way I could think of managing the implementation of this function without macros (although they would really improve the code in this scenario) and still keeping up to speed with the native dll.
I have attached a patch providing my current implementation, providing support for copying data from small unsigned ARGB data to small unsigned ARGB data. A few notes on these few lines of code: 1. if(DestFormat.rbits) { 2. if(SrcFormat.rbits) { 3. for(shift = DestShiftR; shift > DestFormat.rshift; shift -= SrcFormat.rbits) *pPixel |= r << shift; 4. *pPixel |= (r >> (DestFormat.rshift - shift)) << DestFormat.rshift; 5. } else *pPixel |= DestFormat.rmask; 6. }
1: There's no need to mess around with the red component if we don't need it in the destination color anyways 2/5: If the source color does not provide a red component, we assume the maximum red value (which is the red mask) 3/4: These lines convert the source color component to the destination color component; the for loop is needed to make sure that e.g. an R5 0b11111 maps to an 0b11111111 in R8 instead of 0b11111000 (the for loop is faster than using floating point calculation for this) The actual D3DXLoadSurfaceFromMemory implementation doesn't do anything special, the "critical" stuff happens in copy_simple_data.
I hope the code is understandable in the other parts, my few main questions are: Do you agree that the architecture is the best approach or can you think of a better one? would using macros be legal here since it could really help reduce code duplication when implementing the other filters? Is the code itself okay? Suggestions about coding style or other things are welcome. Is there any optimization I missed? - I spend a somewhat large amount of time getting the right balance between wide format support and good performance, but maybe things can be improved a bit more. Right now my implementation is about 10 % slower than the native one.
Any other comments which could be interesting to me at this point are also appreciated, of course.
Best regards, Tony
2009/8/26 Tony Wasserka tony.wasserka@freenet.de:
I hope the code is understandable in the other parts, my few main questions are: Do you agree that the architecture is the best approach or can you think of a better one? would using macros be legal here since it could really help reduce code duplication when implementing the other filters? Is the code itself okay? Suggestions about coding style or other things are welcome. Is there any optimization I missed? - I spend a somewhat large amount of time getting the right balance between wide format support and good performance, but maybe things can be improved a bit more. Right now my implementation is about 10 % slower than the native one.
I think the main thing is that you really don't want branches inside the inner loop. You can handle the cases where a component is unused by building a mask before entering the loop and just or'ing with that mask inside the loop. Perhaps to compiler is smart enough to handle that one, but I don't think you want to rely on it. As for the shifts, in the large majority of cases you will only need two shifts. The only time you need more than that is when the destination component is more than two times the size of the source component, so I don't think you want to bother with that in the common function.
I notice you store the already shifted masks in the table. You'll probably find that that's impractical for formats where the total bit count is larger than 32.
+typedef struct _StaticPixelFormatDesc {
- D3DFORMAT format;
- BYTE abits, rbits, gbits, bbits;
- BYTE ashift, rshift, gshift, bshift;
- DWORD amask, rmask, gmask, bmask;
- UINT bytes_per_pixel;
- StaticFormatType type;
+} StaticPixelFormatDesc;
Looks like you took this from wined3d. I think the typedef and naming are unfortunate. Note that you can calculate the mask and bit count from each other.
+void copy_simple_data(LPBYTE pSrc, UINT SrcPitch, POINT SrcSize, StaticPixelFormatDesc SrcFormat,
LPBYTE pDest, UINT DestPitch, POINT DestSize, StaticPixelFormatDesc DestFormat,
DWORD dwFilter)
The function should be static. "pSrc", "SrcFormat", and "DestFormat" should be const. I'm not a fan of hiding that something is a pointer behind a type like LPBYTE and using Hungarian notation. "BYTE *src" seems simpler and clearer. Same goes for pretending control statements are function calls. Caps should be reserved for enum elements and macros.
+void get_format_info(D3DFORMAT format, StaticPixelFormatDesc *desc) +{
- int i = 0;
- while(formats[i].format != format && formats[i].format != D3DFMT_UNKNOWN) i++;
- memcpy(desc, &formats[i].format, sizeof(StaticPixelFormatDesc));
+}
As long as you don't modify "desc", the memcpy is redundant. Just return a const pointer to the original data.
Hi, thanks for the feedback, I agree with most of what you said. Just a few comments:
I think the main thing is that you really don't want branches inside the inner loop. You can handle the cases where a component is unused by building a mask before entering the loop and just or'ing with that mask inside the loop.
Good point. Not sure whether that eliminates the need for both the 'if(DestFormat.bits[x])' AND 'if(SrcFormat.bits[x])', but I can merge them into one 'if' at least.
As for the shifts, in the large majority of cases you will only need two shifts. The only time you need more than that is when the destination component is more than two times the size of the source component, so I don't think you want to bother with that in the common function.
I don't think there's much overhead by doing the for loop since even in the cases that need to shifts there's at least one if(SrcFormat.bits < DestFormat.bits) necessary. The code inside the for loop will just add one assignment and an additional if. Perhaps lateron this could really be split into a seperate function, but I don't think it's worth to add another hundred lines of code for this small optimization :/
Apart from that, I'll be changing my code accordingly to your suggestions. Best regards, Tony
2009/8/30 Tony Wasserka tony.wasserka@freenet.de:
Good point. Not sure whether that eliminates the need for both the 'if(DestFormat.bits[x])' AND 'if(SrcFormat.bits[x])', but I can merge them into one 'if' at least.
You'd need an AND mask for the DestFormat bits, but it's the same basic principle.
As for the shifts, in the large majority of cases you will only need two shifts. The only time you need more than that is when the destination component is more than two times the size of the source component, so I don't think you want to bother with that in the common function.
I don't think there's much overhead by doing the for loop since even in the cases that need to shifts there's at least one if(SrcFormat.bits < DestFormat.bits) necessary. The code inside the for loop will just add one assignment and an additional if.
I suspect having branches at all inside the loop hurts, but just see what you can come up with.
Henri Verbeet wrote:
I suspect having branches at all inside the loop hurts, but just see what you can come up with.
I don't think that branching inside the loop is a bad thing, branching outside of the loop might cause some serious issues if not handled properly.
James McKenzie
On Wed, Aug 26, 2009 at 12:10:36PM +0200, Tony Wasserka wrote:
- BYTE abits, rbits, gbits, bbits;
- BYTE ashift, rshift, gshift, bshift;
- DWORD amask, rmask, gmask, bmask;
If you make those into arrays you will be able to cut down on the code size, and the compiler might like it better.
I think it would be better to store this in surface.c since I can't think of anything else that would need this.
Hi,
If you make those into arrays you will be able to cut down on the code size, and the compiler might like it better.
While I didn't notice any speedups in the code itself, it's a bit cleaner (especially the format table) now, thanks! (Come to think about it, it wouldn't have made much sense to talk about red masks for luminance formats)
I think it would be better to store this in surface.c since I can't think of anything else that would need this.
D3DXCheckTextureRequirements in texture.c will make use of the format table, since it also needs information about bits per component and component ordering.
On Sun, Aug 30, 2009 at 12:56:26PM +0200, Tony Wasserka wrote:
I think it would be better to store this in surface.c since I can't think of anything else that would need this.
D3DXCheckTextureRequirements in texture.c will make use of the format table, since it also needs information about bits per component and component ordering.
You are right, but it would be neat to have them all grouped like they are in d3dx9tex.h.
I seem to remember one additional thing from the MSDN: The alpha channel has a specific behavior when it is added (RGB -> RGBA), it is set to either 0.0 or 1.0. 1.0 makes sense. I think other channels are set to 0.0 if they are added.
Better test this though.
You are right, but it would be neat to have them all grouped like they are in d3dx9tex.h.
Not sure about that, assuming we ever support all possible filters and conversions in D3DXLoadSurfaceFromMemory, we probably don't want to have the texture stuff starting suddenly at line 3k or something... We won't be able to put the ID3DXBaseMesh, ID3DXMesh, ID3DXPatchMesh, ID3DXPMesh and ID3DXSPMesh interface implementations in one file either.
I seem to remember one additional thing from the MSDN: The alpha channel has a specific behavior when it is added (RGB -> RGBA), it is set to either 0.0 or 1.0. 1.0 makes sense. I think other channels are set to 0.0 if they are added.
Better test this though.
According to http://msdn.microsoft.com/en-us/library/bb206254%28VS.85%29.aspx (and to my tests), new channels are set to their maximum value with the exception of A8, which was not yet handled in my patch but will be trivial to add in the new version.