Fabian Maurer dark.shadow4@web.de writes:
Hello Alexandre,
Multi-language support, Japanese, Korean, multi-char sequences, surrogates, linguistic mappings, etc.
There are a million things that need to be supported for proper sorting. You don't have to implement them all, but it should be clear from your approach that they can be added. Which in practice means you need to at least prototype most of them.
Well, they can be added, it's just that I left them out for the initial versions... Short breakdown:
- Multi-language: The character is looked up the current language, as a
fallback the default is used. Currently, only the default is implemented
I don't see any language support, there's just one big sortkey table. Yes, that's what the current code is doing too, but if we are rewriting it, we should get the architecture right.
- Multi-char sequences: You man when a single codepoint is encoded as more
than one WCHAR? Is supported, windows seems to treat each WCHAR separately
I mean when multiple chars map to one sortkey. The COMPRESSION sections in the Microsoft table.
- Linguistic mappings: Not sure what you mean, sorry
NORM_LINGUISTIC_CASING and the like.
Question: How should I prove it works? I can't possible add all of that in the first draft.
The usual way is to add a bunch of tests with todo_wine, and then send a patch series with each patch removing the corresponding todos.
We only have tests for a very small number of strings, that's clearly not proper coverage. Some way of systematically generating test strings should be considered.
Like, random strings from a known seed? I intentionally didn't do that, because of performance concerns.
Not necessarily random, but some interesting data. For instance the normalization tests can run the entire test suite from unicode.org, you may be able to find something similar. Or build your own somehow.
Also testing sort keys directly, like you did in the first try (but without depending on the exact values).
I've that planned, yes. Do you want that in the first version already?
The tests should come before the code, or at the same time.
Note that we most likely want to use a Windows-compatible NLS file, like we are now using for codepage or normalization tables. I can work on that part.
I have to admit, I don't know what you mean by that. I don't know about NLS files.
This is new stuff. Look at the nls directory, and at the make_unicode script.