Hi Juan,
I took the liberty of answering to the list. I hope you don't mind.
Juan Lang wrote:
Hi Schachar, something you said earlier this year at wineconf caught my ear, and I think I'm finally understanding enough to ask about it. You said Windows uses UCS2-LE (I think you did, anyway) rather than UTF16. Is that true?
I most certainly didn't say that. I may have mentioned UCS4, but to the best of my knowledge at the time, Windows uses UTF-16.
As a result of your email, I tried to check it again, however. The results are somewhat inconclusive. I generated a file (attached) containing two musical symbols in Unicode. These are from the codepoints above U10000, and are therefor unrepresentable in UCS2.
Opening this file up in notepad reveals a mixed result. On the one hand, they appear as two unknown characters (which seems to suggest that at least Notepad understands this is UTF-16LE). On the other hand, I did not manage to find any font that has any characters with codepoints above U10000.
I'm hoping one of our far-eastern speakers jumps in with more insight (Mike?). Are surrogates used on Windows? How common are they?
Does that then imply that Windows really can't handle characters outside the Basic Multilingual Plane? Does that also imply that WCHARs are in fact fixed-width in Windows?
As far as I know, they are not. Sorry.
I'm planning to write a tool to detect the following problematic bit of code: char str[] = "hi", *p = str + sizeof(str) - 1; p--; At least, it's problematic when str contains double-byte characters.
I'm not sure what you are aiming at achieving. Are you trying to hit the beginning of the last character of the string? If so, then you do, indeed, have a problem here.
If the type of char and p were WCHAR * instead, and WCHARs are fixed width, there isn't a problem.
If you have thoughts on similar programming bugs relating to internationalization (use of strchr and strrchr, for example) I'd be interested to hear them.
In the past I have written programs that had to do MBCS (the non-unicode Japanese encoding). This is an encoding in which some characters are one byte, and some two. The best I could come up with was to build a wrapper around std::string that had two bytes per character internally. When you loaded a string, it would check character by character for whether it's a double byte, and then have each string location contain exactly one character. This allowed random access, as well as both forward AND backwards scanning.
Fortunately, UTF is much better than MBCS. Given a byte in either UTF-8 or UTF-16, it's fairly easy to figure out whether it's part of a surrogate, and what part. If you have assurance that the string you are handling is a well formed one, you can do backward scans of a UTF string fairly easily.
Thanks, --Juan
Do you Yahoo!? Declare Yourself - Register online to vote today! http://vote.yahoo.com
Do you want a gmail account?
Shachar
--- Shachar Shemesh wine-devel@shemesh.biz wrote:
I took the liberty of answering to the list. I hope you don't mind.
Not at all.
I most certainly didn't say that. I may have mentioned UCS4, but to the best of my knowledge at the time, Windows uses UTF-16.
Ah. It might have been Chris Hertel that said that then. The samba folks may see that on the wire.
WCHARs are in fact fixed-width in Windows?
As far as I know, they are not. Sorry.
Okay. That's fine. I'm just trying to understand the encodings correctly.
I'm planning to write a tool to detect the
following
problematic bit of code: char str[] = "hi", *p = str + sizeof(str) - 1; p--; At least, it's problematic when str contains double-byte characters.
I'm not sure what you are aiming at achieving. Are you trying to hit the beginning of the last character of the string? If so, then you do, indeed, have a problem here.
Yes, that's what the code's doing. I'm actually doing a research project for a class. My project partner and I are thinking of using static analysis to detect this sort of bug. We can probably just use lexical analysis to detect other bogus things, like strchr and strrchr. We're thinking some tools like this might help catch some internationalization bugs.
In the past I have written programs that had to do MBCS (the non-unicode Japanese encoding). This is an encoding in which some characters are one byte, and some two. The best I could come up with was to build a wrapper around std::string that had two bytes per character internally. When you loaded a string, it would check character by character for whether it's a double byte, and then have each string location contain exactly one character. This allowed random access, as well as both forward AND backwards scanning.
That seems reasonable.
Fortunately, UTF is much better than MBCS. Given a byte in either UTF-8 or UTF-16, it's fairly easy to figure out whether it's part of a surrogate, and what part. If you have assurance that the string you are handling is a well formed one, you can do backward scans of a UTF string fairly easily.
Indeed. Like you said, it's the MBCS/DBCS encodings that are particularly bad in this respect.
Do you want a gmail account?
Got one, haven't used it much yet.
Thanks, --Juan
_______________________________ Do you Yahoo!? Express yourself with Y! Messenger! Free. Download now. http://messenger.yahoo.com