Hello Alexandre,
Multi-language support, Japanese, Korean, multi-char sequences, surrogates, linguistic mappings, etc.
There are a million things that need to be supported for proper sorting. You don't have to implement them all, but it should be clear from your approach that they can be added. Which in practice means you need to at least prototype most of them.
Well, they can be added, it's just that I left them out for the initial versions... Short breakdown:
- Multi-language: The character is looked up the current language, as a fallback the default is used. Currently, only the default is implemented
- Japanese: Main reason why I did all of this. Special case, but supported by the tables.
- Korean: Handled under Jamo. Special case, but supported by the tables. Currently not properly implemented by me because it's a lot of work
- Multi-char sequences: You man when a single codepoint is encoded as more than one WCHAR? Is supported, windows seems to treat each WCHAR separately
- Surrogates: Windows seems to treat each WCHAR on their own
- Linguistic mappings: Not sure what you mean, sorry
Question: How should I prove it works? I can't possible add all of that in the first draft.
For instance you do 10 memory allocations before even starting to compare anything. That's clearly not cheap.
I understand. But for a dynamic sized sortkey I need to have dynamic buffers. Maybe I could put the initial buffers on the stack?
We only have tests for a very small number of strings, that's clearly not proper coverage. Some way of systematically generating test strings should be considered.
Like, random strings from a known seed? I intentionally didn't do that, because of performance concerns.
Also testing sort keys directly, like you did in the first try (but without depending on the exact values).
I've that planned, yes. Do you want that in the first version already?
When there are differences between Windows versions we want to use the latest, since that's the one that will continue to work in the future. In this case it means using the most recent table.
Okay then. If that's important, I can change the table.
Note that we most likely want to use a Windows-compatible NLS file, like we are now using for codepage or normalization tables. I can work on that part.
I have to admit, I don't know what you mean by that. I don't know about NLS files.
Regards, Fabian Maurer