https://bugs.winehq.org/show_bug.cgi?id=50898
Bug ID: 50898 Summary: Improve performance for RGB lookups into color tables conversion Product: Wine Version: 6.5 Hardware: x86-64 OS: Linux Status: NEW Severity: enhancement Priority: P2 Component: gdi32 Assignee: wine-bugs@winehq.org Reporter: gabrielopcode@gmail.com Distribution: ---
Created attachment 69710 --> https://bugs.winehq.org/attachment.cgi?id=69710 Generate a lookup cache for RGB into color table conversion
This improves the performance of color conversion from RGB to palette significantly by generating a lookup cache for 5-bit-per-color RGB values (centered) into the palette, before doing the conversion, for older applications and games. The comments in the patch should describe the algorithm I came up with for large color tables.
Note that Windows also quantizes the lookup to 5-bit-per-color already, so it very likely does something similar (but seems to do more caching based on my tests); we have this quantization in wine's code to match Windows, but we don't take advantage of it, so it's wasted.
Generating it by looking through the color table for each entry in the lookup cache wouldn't be a good idea if the color table is large. The lookup cache itself is 32768 entries, which turns out to be the equivalent of a 256x128 image. The current algorithm used by wine works like that already: it goes through each pixel in the image, and for each pixel, it looks up every color in the color table, which is extremely slow. Generating the cache like that would mean images have to be at least 256x128 for it to have any benefit, otherwise it would be a performance regression instead, and that's not very reasonable to expect of older applications, and it wouldn't provide much.
So, the table generation algorithm is much more efficient than that for large amount of colors in the table. In my synthetic tests generating 4096 tables with various color tables of 256 colors, it turned out to be at least 4x faster, sometimes even 4.5x depending on color table (even a purely random color table), for example a random table taking about 14406ms instead of 62056ms on my CPU.
In fact, the algorithm is not far from being constant in terms of amount of colors in the table, taking only 50% more time from 16 colors to 256. So basically increasing the amount of colors from 16 to 256 doesn't make that much difference, since each color usually fills its own part of the lookup cache, which is not filled again (unless distance is shorter due to quantization, but that's only for one generation anyway). In contrast, going through each color in the table for each entry (the original algorithm) would massively increase the amount of time spent (from less than 10 seconds, up to more than a minute, or 6x slower).
For a few benchmark examples: a 256-color table with the algorithm takes about 14000ms to do 4096 iterations on my CPU, a 40-color table takes about 12000ms, so it's only 17% slower with this algorithm. The 40 colors seems to be the cutoff on my CPU where the algorithms start to differ in performance, with the straight one being faster at lower amount of colors, presumably due to better cache locality and branch prediction.
A fast and easy way to "see this in action" is to download Winamp, then launch it and right-click on the top, click on the "Nullsoft Winamp..." menu item and then go to the Credits tab. Depending on your CPU you can clearly see the FPS difference and/or CPU usage for one core. (Note that it still has some fps drops, this is normal and happens on Windows as well, even when the CPU core is not maxed out, it's inherent in the app)
http://web.archive.org/web/20180902190237/http://winamparchive.org/dl/winamp...
Further improvements would require caching it somehow between calls, but I believe that requires either more fragile changes to gdi32 code, or semi-hacks to deal with a typical applications' expectations, so this should be a good start for now.