Re: [PATCH 1/5] kernelbase/locale: Implement sortkey generation on official tables

30 Apr 2020


      On 4/30/20 12:50 PM, Fabian Maurer wrote:
...
Hello Zebediah,
thanks for your reply!
...
So as far as I understand, the sort key algorithm writes the level 0
weights (script and alphabetic weight) for the whole string to the sort
key, then the level 1 weights (diacritic), and so on, right?
Yes.
...
In that case, what seems potentially simpler to me is to calculate those
weights one level at a time, rather than one character at a time.
But that would mean that we would iterate over the string multiple times, and
in the function have a branch as to which weight to write.
To me, my current approach seemed easier. I've thought quite a while about
abandoning my lists, but the results ended up being a lot more complex.
Yes, you'd have to iterate over the string multiple times. However,
that's a fixed, small number. I wouldn't be concerned about performance
until we find evidence that performance is a concern.
If you're concerned about readability, I would submit that from an
outside perspective, the current implementation is not very readable.
I don't think you need to write it such that you switch over which
weight to write, either. It's probably easier just to make them separate
functions, like in my example below.
...
...
That is, you'd end up doing something like
static int get_sortkey( DWORD flags, const WCHAR *src, int srclen, char
*dst, int dstlen )
{
     int used = 0;
     for (i = 0; i < srclen; ++i)
     {
         used += get_main_weights(src[i], dst + used, dstlen - used);
         if (!(flags & NORM_IGNORENONSPACE))
             used += get_diacritic_weights(src[i], dst + used, dstlen -
used);
         ...
     }
}
This won't work, since we first need all main weights, then the others.
Er, yes. I misthought. What I meant to write was
for (i = 0; i < srclen; ++i)
    used += get_main_weights(src[i], dst + used, dstlen - used);
if (!(flags & NORM_IGNORENONSPACE))
{
    for (i = 0; i < srclen; ++i)
        used += get_diacritic_weights(src[i], dst + used, dstlen - used);
}
...
...
Also keep in mind, that we need to have some temporary buffer anyways, since
weights that are very small only get added when there is a bigger one
following it. Either that, or we need to backtrack and remove weights again.
Due to those issues, I opted for the current approach.
Can you please give an example of this? I'm not sure I see it in any of
your patches.
...
...
As that example shows, I also think it's probably simpler to just pass
the buffer directly to whatever functions are writing sortkey bytes into it.
As noted, it's not that easy.
Especially when you need to deal with buffers that are too short - we'd need
to mix in a stop condition but still return the needed length. Sometimes, we
just need to continue iterating despite knowing we can't copy it into the
sortkey.
Generally, you can continue to calculate size but only write when
there's space in the buffer. When I've written functions like this I've
found it to be relatively simple, though maybe there's some details here
that I'm missing...
...
...
Is there a reason to use LCMapStringEx() here rather than LCMapString()?
LCMapStringA/W completely ignore the locale, and use a LCID. We'd need to fix
those functions to properly convert the locale. To start simple, I decided to
use the more low-level function and once that works, built the rest on top.
It seems a simple enough fix; just call LCIDToLocaleName().
It's not terribly important, though.
...
...
The fact that this test is commented out never struck me as great. I'm
pretty sure that with todo_wine added as appropriate, it could pass. A
first patch in this series could be to do that.
Probably. I decided to leave it alone for now, and comment it in once it
works, but sure, I can change that.
...
Are these comments useful?
Eh, I used them for a quick visual separation. I can remove them if they're
too much noise though.
...
I get the impression that typedefs have largely fallen out of favour.
I always use them when I can, since they make the code shorter and I don't
think having to add the struct keyword adds much. I've seen both in wine code,
but given a policy I'll stick to that.
Generally I believe that explicitly specifying "struct" is preferred as
it makes clear what kind of object you're dealing with.
...
...
The fact that exactly one of these integer cases has a symbolic constant
attached seems less than ideal.
True, I'll add constants for the others as well.
Regards,
Fabian Maurer

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [PATCH 1/5] kernelbase/locale: Implement sortkey generation on official tables