Re: [PATCH v2 3/3] gdi32: Use a global cache for frequently used color tables for RGB value lookups.

7 Apr 2021

On 07/04/2021 12:03, Huw Davies wrote:
...
On Fri, Apr 02, 2021 at 03:54:02PM +0300, Gabriel Ivăncescu wrote:
...
Usually, when there's performance issues with RGB into color table
conversions, the same color table is used. So instead of generating it over
and over, this caches it if it's frequently accessed (as should be the case
when performance is an issue in the first place).
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=50898
Signed-off-by: Gabriel Ivăncescu gabrielopcode@gmail.com

dlls/gdi32/dibdrv/primitives.c | 89 +++++++++++++++++++++++++++++++++-
  1 file changed, 88 insertions(+), 1 deletion(-)

diff --git a/dlls/gdi32/dibdrv/primitives.c b/dlls/gdi32/dibdrv/primitives.c
index 8947c60..6071280 100644
--- a/dlls/gdi32/dibdrv/primitives.c
+++ b/dlls/gdi32/dibdrv/primitives.c
@@ -3511,12 +3511,99 @@ struct rgb_lookup_colortable_ctx
static void rgb_lookup_colortable_init(const dib_info *dib, struct rgb_lookup_colortable_ctx *ctx)
  {

unsigned r, g, b;


/*

  Globally cache frequently generated maps, as most apps usually only use a few color tables often.


  We keep the variables separated for better cache locality when looking them up (tables are large).


*/

static SRWLOCK global_cache_lock = SRWLOCK_INIT;

static volatile DWORD global_cache_sequence;


/* ticks when the cached map was last accessed or generated; we discard the oldest first */

static DWORD global_cache_ticks[4];


/* number of colors in the given cached color table */

static unsigned global_cache_color_table_size[ARRAY_SIZE(global_cache_ticks)];


/* the actual cached color tables to lookup and see if they're cached */

static RGBQUAD global_cache_color_table[ARRAY_SIZE(global_cache_ticks)][256];


/* the associated generated maps for the color tables above */

static BYTE global_cache_map[ARRAY_SIZE(global_cache_ticks)][ARRAY_SIZE(ctx->map)];



unsigned color_table_size = dib->color_table ? dib->color_table_size : 1 << dib->bit_count;

const RGBQUAD *color_table = get_dib_color_table(dib);

DWORD newest, oldest_entry, sequence;

unsigned i, r, g, b;


/* check the global cache first */

if (TryAcquireSRWLockShared(&global_cache_lock))

{

   newest = global_cache_ticks[0];


   for (i = 0; i < ARRAY_SIZE(global_cache_color_table_size); i++)


   {


       if (color_table_size == global_cache_color_table_size[i] &&


           !memcmp(color_table, global_cache_color_table[i], color_table_size * sizeof(color_table[0])))


       {


           memcpy(ctx->map, global_cache_map[i], sizeof(ctx->map));



           /* it's a shared lock, but the exact timestamp is unimportant, so it's OK */


           global_cache_ticks[i] = GetTickCount();



           ReleaseSRWLockShared(&global_cache_lock);


           return;


       }


       if ((LONG)(newest - global_cache_ticks[i]) < 0)


           newest = global_cache_ticks[i];


   }


   sequence = global_cache_sequence;


   ReleaseSRWLockShared(&global_cache_lock);


}

else

{

   newest = global_cache_ticks[0];


   sequence = global_cache_sequence - 1;  /* make sure we re-check the cache */


}

for (b = 4; b < 256; b += 1 << 3)
    for (g = 4; g < 256; g += 1 << 3)
        for (r = 4; r < 256; r += 1 << 3)
            ctx->map[r >> 3 | (g & ~7) << 2 | (b & ~7) << 7] = rgb_to_pixel_colortable(dib, r, g, b);




/* update the global cache if needed */

AcquireSRWLockExclusive(&global_cache_lock);


/* if the cache was updated underneath us, check it again */

if (sequence != global_cache_sequence)

{

   for (i = 0; i < ARRAY_SIZE(global_cache_color_table_size); i++)


   {


       if ((LONG)(newest - global_cache_ticks[i]) <= 0 &&


           color_table_size == global_cache_color_table_size[i] &&


           !memcmp(color_table, global_cache_color_table[i], color_table_size * sizeof(color_table[0])))


       {


           global_cache_ticks[i] = GetTickCount();


           ReleaseSRWLockExclusive(&global_cache_lock);


           return;


       }


   }


}

global_cache_sequence++;


oldest_entry = 0;

for (i = 1; i < ARRAY_SIZE(global_cache_color_table_size); i++)

   if ((LONG)(global_cache_ticks[oldest_entry] - global_cache_ticks[i]) > 0)


       oldest_entry = i;



memcpy(global_cache_color_table[oldest_entry], color_table, color_table_size * sizeof(color_table[0]));

memcpy(global_cache_map[oldest_entry], ctx->map, sizeof(ctx->map));


global_cache_color_table_size[oldest_entry] = color_table_size;

global_cache_ticks[oldest_entry] = GetTickCount();


ReleaseSRWLockExclusive(&global_cache_lock);
}


Yeah, I'm not convinced, not even close.  You're going to have to
provide some pretty good evidence, with real world apps, that this is
necessary over a local, lazy-init solution.
Huw.
Well a simple real world test is in the bug report with Winamp's 
Credits, while it's not that important, it's easy enough to test 
compared to, say, games.
When I tested on Windows it does seem to cache color tables between 
calls because it was just way too fast, unless they have some voodoo 
algorithm to generate the color table extremely quickly on each call.
Of course it depends on your CPU's performance, but this patch makes it 
use very little CPU so definitely it's massively faster than other methods.
By the way, I revised the table generation algorithm, it's correct now 
and simplified, and only about ~32% slower on average with 256 colors. I 
attached the diff here on top of these patches so you can have a look if 
you prefer that.
What it does is basically it simply goes through all 6 directions at an 
entire offset and checks distance on all. If none were filled in a 
particular direction, it drops it from further searches (that's the main 
benefit).
It doesn't mean it will go along with these patches of course, it's just 
to give an idea and why it can be extended easily. If you want I can 
easily drop the global cache for now and just send this patch 
instead—which is still a good improvement.
While it does provide a massive performance benefit by itself and allows 
Winamp to show the Credits at full speed (just like Windows), it still 
uses ~80% of my CPU Core to do it, while Windows uses only ~10-20% (and 
the global cache does as well, that's why I suspect that of Windows).
However, I'm perfectly fine with the idea of dropping the global cache 
patch, if you think just this patch would be enough of an improvement. I 
mean, we could always extend the global cache later, maybe in better ways.
But at least it doesn't "lock us" into it like the lazy init approach 
would, since then the cache would become way more complicated no matter 
what shape it would have. For example, we'd have to store the 
initialized bits, copy them from the cache (so we don't keep the lock 
while we process the image), do the lazy init processing, then reacquire 
the lock and merge the results (the underlying cache could've changed as 
well, and we have to merge new lazy inited entries in the cache). It 
sounds way too complicated and not worth it to go with it, that's why I 
dislike the lazy init approach.
What do you think?

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH v2 3/3] gdi32: Use a global cache for frequently used color tables for RGB value lookups.