Dear All,
My name is Artur and I'm participating in Google Summer of Code 2017 for Wine. Under Nikolay's supervision, I'm working on implementation of Unicode normalization. I probably should have introduced myself some time ago to share results of my research and my ideas, but I also wanted to wait until I could illustrate my points with some code.
I'm not sure whether everyone is familiar with Unicode normalization forms, so here's a short summary of concepts and terms I may use. For detailed information, please see [1].
- Characters that look the same on the screen may be encoded in a few different ways. Accented Latin letters may constitute a familiar example. The letter 'é' can be U+00E9 (LATIN SMALL LETTER E WITH ACUTE), but it also can be represented as the sequence U+0065 U+0301 (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT). Many of 1114112 (if I remember the number correctly) Unicode characters are decomposable, i.e. they can also be represented by a different sequence of characters.
- Text in normalization form NFD ('D' stands for 'decomposition') does not contain any decomposable characters, e.g. there is U+0065 U+0301 instead of U+00E9. The combining characters (such as the above-mentioned acute accent) have a well defined order, using the concept of combining classes. [2 §3.11]
- Text in normalization form NFKD ('K' stands for 'compatibility') has the properties of NFD, plus characters that are typographical variants of other characters (e.g. superscripts, ligatures, etc.) are mapped to 'plain' characters.
- Text that will be normalized into the form NFC and NFKC ('C' stands for 'composition') has to be fully decomposed first (NFD and NFKD, respectively) and starters (characters with a combining class zero) plus following non- starters (combining class greater than zero) can be composed into single characters provided that they exist in Unicode and are not present in [3].
I've developed a draft implementation of NFD & NFKD (mapping characters to decomposed sequences as well as reordering combining characters) and written a test program that uses official Unicode test data. [4]
My goal was to provide support for codepoints greater than 0xFFFF, encoded in UTF-16 as surrogate pairs. More on that below.
The attachment [gsoc-v2.patch] is a diff of my working branch against master. Most notable changes include:
- Necessary changes to tools/make_unicode.
- New function wine_decompose_string().
- Draft implementation of two Windows API functions: NormalizeString() and IsNormalizedString() (the latter temporarily returns TRUE). [5][6]
Rationale and comments:
- tools/make_unicode did not collect all necessary data (most notably, compatibility mappings), so a new data structure, @decomp_table_full has been added. Its unification with @decomp_table is yet to be resolved.
- The procedure dump_decompose_table() in that script has been rewritten to accomodate all aspects described above and below. Two helper subroutines have been added.
- However, the C code section embedded within the script has grown substantially, so it may be desirable to move it to a separate file.
- Adding wine_decompose_string() makes sense since surrogate pairs behave like a string and the Unicode Canonical Ordering Algorithm (for combining characters) is an in-place bubble sort on the string, per Unicode specification.
- While the code in libs/port/decompose.c (and embedded in tools/make_unicode) implements normalization, it currently serves the purpose of confirmation that the generated data is correct. It hasn't been yet optimized or cleaned up, because the organization of the data is open to discussion now, so the code may still change substantially.
- It has been previously assumed that a decomposition mapping of any character is two characters long. This is certainly not true for NFKD, where a character like U+FDFA decomposes to a string of length 18. Therefore, the tables in libs/port/decompose.c contain pairs of offsets to null-terminated mappings in another table (decomp_data, first in that file), one entry for NFD and one for NFKD, instead of two-character mappings. Moreover, since many NFD and NFKD are identical, they have been placed in the same table, with identical mappings merged and fully expanded (i.e. any decomposable characters in the data have been replaced by fully decomposed sequences), to minimize runtime overhead.
- The original idea of multi-stage tables (using the most significant bits of a codepoint to obtain an offset to another table etc.) has been preserved, although the tables are no longer combined into a single table. This is because some sub-tables can use a narrower data type, which saves space, especially when data for all Unicode codepoints is included.
- Tables of combining classes have a similar structure.
- Mappings for characters above 0xFFFF are encoded as UTF-16 (using surrogate pairs), but a single codepoint (UTF-32 if you like) is used for table indexing. Setting $utflim in make_unicode to 65536 is the simplest way to disable support for such characters, but supporting surrogate pairs should not affect any text-related Wine component in a negative way.
- As composition depends directly on decomposition, it is the next thing to work on.
The attachment [unorm_test.c] is my test program. It should build fine with mingw-w64 or winegcc without any specific compiler flags. It looks for [4] in its directory or accepts a path to that file as an argument. Option -d makes it run only decomposition tests, option -s limits the range to 0xFFFF. No tests should fail with the -s option and 80 of 187460 should fail without it due to incomplete support for surrogate pairs in the reordering part. For details of failed tests, please run with the -v option. Is there a place for such a program somewhere in the Wine source tree?
I'd be more than happy to answer your questions, address your comments, and listen to your opinions regarding Unicode normalization.
Regards, Artur Świgoń
[1] http://www.unicode.org/reports/tr15/ [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf [3] http://www.unicode.org/Public/10.0.0/ucd/CompositionExclusions.txt [4] http://www.unicode.org/Public/10.0.0/ucd/NormalizationTest.txt [5] https://msdn.microsoft.com/en-us/library/windows/desktop/dd319093(v=vs.85).a... [6] https://msdn.microsoft.com/en-us/library/windows/desktop/dd318671(v=vs.85).a...
Hello,
On 7/25/17 4:33 PM, Artur Świgoń wrote:
Dear All,
My name is Artur and I'm participating in Google Summer of Code 2017 for Wine. Under Nikolay's supervision, I'm working on implementation of Unicode normalization. I probably should have introduced myself some time ago to share results of my research and my ideas, but I also wanted to wait until I could illustrate my points with some code.
Very cool! This is a problem I ran into with Japanese unicode string comparisons a while ago so it is great it will be addressed! Then we will have to investigate the CompareStringW, and family, behavior.
- Mappings for characters above 0xFFFF are encoded as UTF-16 (using surrogate pairs), but a single codepoint (UTF-32 if you like) is used for table indexing. Setting $utflim in make_unicode to 65536 is the simplest way to disable support for such characters, but supporting surrogate pairs should not affect any text-related Wine component in a negative way.
There is some super basic work on non-BMP unicode glyphs and surrogate pairs in Uniscribe (usp10). I wrote a quick decode_surrogate_pair() function to help get a DWORD unicode value for the surrogate pair. So you can look at that if you are interested!
Thanks! -aric