On Sun, 5 May 2002, Uwe Bonnes wrote:
looking at LCMapStringW, I think we need some table like the LCM_Unicode_LUT[] table. However
- I don't understand where the values come from. Odd values seem to be a collation of flags, even values to be some character weight and LCM_Diacritic_LUT[] is some weight for the diacritic.
Pretty much. The first value in the Unicode_LUT pairs seems to be what I've previously identified (in my reverse engineering of cp_xxx.nls files) as the sort class, the second as the sort weight, and the Diacritic_LUT is the diacritic weight. (Case weight also exist; it is not in those tables, but the case weight is pretty much isupper(x) ? 18 : 2, so no table is used there.)
Sort classes I've identified before: 2 = decomposed sort (e.g. "ß" is sorted as "ss", "þ" is sorted as "th") (sort weight is used as index into decomposition table in cp_xxx.nls) 6 = control characters, hyphens (stuff that's ignored if SORT_STRINGSORT is not specified) 7 = separators 8 = math symbols 10 = symbols 12 = numbers 14 = letters
All weights and classes start on 2 simply because they're used in sort keys generated by LCMapString, which is a string where 0 is the null-terminator and 1 is the field-separator.
- Do the tables in ../wine/unicode somehow contain enough information to generate these tables?
The UnicodeData.txt you can get from ftp.unicode.org contains data that you can use for the sort class, case weight, and maybe diacritic weight, but not sort weight, since that's locale-dependent; you need a sort table for each locale. (I think Windows deals with it by having a big table of default sort weights, then each locale has a table of "exceptions" that's patched into the big table at run-time...)
Unfortunately, I'm not aware of a source for such sort weight data.