Hi,
On Redhat 9 I get errors like this one when doing a 'make htmlpages':
Malformed UTF-8 character (unexpected non-continuation byte 0x6e, immediately after start byte 0xfc) in substitution (s///) at ../../tools/c2man.pl line 313, <SOURCE_FILE> line 2.
This is because I have LANG="en_US.UTF-8" as part of my environment, and perl will now switch to character semantics (as opposed to byte semantics) when it detects a Unicode character set. Wine source files contain characters with ordinals > 127 (it looks like the Wine sources are ISO_8859-1) and of course, these usually don't also form valid UTF-8 sequences.
Off hand I see three solutions (in order of increasing acceptability):
1. Convert Wine source files to ASCII-7 or UTF-8 2. Set character set to "C" or "ISO8859-1" prior to running perl on the sources 3. Force perl back into using byte semantics
1. Most non-ASCII-7 characters are in C comments (in the names of authors, e.g. Ove Kåven). But there are files like dlls/x11drv/keyboard.c that contain them as part of a C string. Going this way would mean these characters would have to be escaped.
Of course this is a step backwards. It would degrade readability of the sources and probably offend some awkwardly named authors ;^) It would also require a change of pratice, which is hard to accomplish.
Converting to UTF-8 seems more promising. C strings still need to be escaped but then our Hungarian, authors can finally have their names spelled properly in the sources! Still, there are more programs that have to interpret C source files and I estimate that most of them do not yet handle UTF-8 properly (though vi and emacs are amongst the capable).
2. Changing the character set beforehand will get rid of the errors and is much less controversial than the above solution ;) But it's still cumbersome to have to do so.
3. This will work regardless of the character set specified in the user's environment. The attached patch does this for c2man.pl
Bye,
-Hans
Changelog: Force perl to use byte semantics
Hans Leidekker wrote:
Hi,
This is because I have LANG="en_US.UTF-8" as part of my environment, and perl will now switch to character semantics (as opposed to byte semantics) when it detects a Unicode character set. Wine source files contain characters with ordinals > 127 (it looks like the Wine sources are ISO_8859-1)
No, they are in whatever locale the string is. In particular, the entire keyboard code is filled to the brim with strings, each with a different locale. I'm talking about functional code here, not something which is only inside comments.
Another place where everything is with different locaele are the resources.
and of course, these usually don't also form valid UTF-8 sequences.
Off hand I see three solutions (in order of increasing acceptability):
- Convert Wine source files to ASCII-7 or UTF-8
No can do ASCII. A hebrew "שלום" will not look good, or at all, for that matter, in ASCII. UTF-8 may work for resources, if the resource compiler is adjusted accordingly, but not inside the code, where the encoding actually matters for the code that parses it.
- Set character set to "C" or "ISO8859-1" prior to
running perl on the sources
That sounds better, I think... What does perl do with the sources again?
- Most non-ASCII-7 characters are in C comments (in
the names of authors, e.g. Ove Kåven). But there are files like dlls/x11drv/keyboard.c that contain them as part of a C string. Going this way would mean these characters would have to be escaped.
I offered that some time ago. This can also mean that the strings can be unicode proper. The general consensus at the time was that this should not be the case, so that the maintainer of the language can easily check their layout.
Converting to UTF-8 seems more promising. C strings still need to be escaped but then our Hungarian, authors can finally have their names spelled properly in the sources! Still, there are more programs that have to interpret C source files and I estimate that most of them do not yet handle UTF-8 properly (though vi and emacs are amongst the capable).
Plus you have not solved the functional strings problem.
Shachar
On Sat, 17 May 2003, Shachar Shemesh wrote:
No, they are in whatever locale the string is. In particular, the entire keyboard code is filled to the brim with strings, each with a different locale. I'm talking about functional code here, not something which is only inside comments.
I know Wine sources are not declared as adhering to any particular character set, but when I display them using ISO_8859-1 I see the least distortions. That's why I said "it looks like" they are ISO_8859-1.
No can do ASCII. A hebrew "ש×××" will not look good, or at all, for that matter, in ASCII.
That's obvious. Hebrew won't look good in ISO_8859-1 either. Then, like I said, your option is to "escape" characters outside ASCII-7, like Germans do with their umlauts. If that Hebrew string you presented is your name, then "Shachar" could be seen as an escaped ASCII-7 notation for it, couldn't it?
UTF-8 may work for resources, if the resource compiler is adjusted accordingly, but not inside the code, where the encoding actually matters for the code that parses it.
- Set character set to "C" or "ISO8859-1" prior to
running perl on the sources
That sounds better, I think... What does perl do with the sources again?
By Perl I in fact mean any Wine tool that's written in Perl. Mostly running regexps on the sources is what they do I guess.
Plus you have not solved the functional strings problem.
What do you mean by "functional strings"?
-Hans
Hans Leidekker wrote:
On Sat, 17 May 2003, Shachar Shemesh wrote:
No, they are in whatever locale the string is. In particular, the entire keyboard code is filled to the brim with strings, each with a different locale. I'm talking about functional code here, not something which is only inside comments.
I know Wine sources are not declared as adhering to any particular character set, but when I display them using ISO_8859-1 I see the least distortions. That's why I said "it looks like" they are ISO_8859-1.
That's because people with names outside of the 8859-1 charset rarely assume that any client will be able to read their name, and write it in latin (Japanese call it "Romanji") letters. European names, on the other hand, rarely have pure-latin transcripts, because the letters are too similar. Irony.
No can do ASCII. A hebrew "שלו×" will not look good, or at all, for that matter, in ASCII.
As your locale is UTF-8, you made my string twice as long `-)
That's obvious. Hebrew won't look good in ISO_8859-1 either.
No, but it will, at least, be preserved. Not critical to comments, but is critical to non-lating strings.
Then, like I said, your option is to "escape" characters outside ASCII-7, like Germans do with their umlauts.
Care to show what you mean?
If that Hebrew string you presented is your name,
Nah, far too long for that. My name is just three letters. Get the full story at http://www.shemesh.biz/sun.html.
then "Shachar" could be seen as an escaped ASCII-7 notation for it, couldn't it?
If you mean that instead of writing "שחר", I should write "\xfa\xe8\xf9", then I think you are talking non-practical solutions here. It took me less then a second to write the native version - I just typed it. It took me almost a minute to write the escaped version, and I can only speculate as to whether I got it right. I just redid it, because I have, in fact, not got it right. What CJK people are expected to do is not something I would like to contemplate. In addition to that, noone, not even Hebrew speakers, can be reasonably expected to understand what is written there. That is a majour source for problems.
Having said that, there is one place I did exactly this in the Wine sources. In dlls/commdlg/font.c, you can find, near the begining of the file, a table of the characters that the font dialog should display for the corresponding locale. The enteries in that table are in UTF-16, as I couldn't make each string of a different locale. As a result, they are, indeed, unreadable. As this is not a true string, but simply a few character to demo a font, I'm hoping it will not matter much.
UTF-8 may work for resources, if the resource compiler is adjusted accordingly, but not inside the code, where the encoding actually matters for the code that parses it.
- Set character set to "C" or "ISO8859-1" prior to
running perl on the sources
That sounds better, I think... What does perl do with the sources again?
By Perl I in fact mean any Wine tool that's written in Perl. Mostly running regexps on the sources is what they do I guess.
Then I vote for this. 8859-1 will not distort the sources, which is all that is really required.
Plus you have not solved the functional strings problem.
What do you mean by "functional strings"?
I mean strings that actually perform some function, as opposed to comments. The most prominant example is keyboard.c, where each string is of a different encoding. The code at fontdlg.c is also an example.
-Hans
Much thought I like UTF-8, I think it is totally and utterly inapropriate for handling the Wine code. Like it or not, MS chose UTF-16 (actually, they chose UCS-2, and then made it UTF-16 when it was invented, IIRC), and that's what Wine must choose as well. Given that fact, it makes no sense to have strings inside Wine in UTF-8, as that would require runtime convertions. If the strings are not UTF-8, there is no reason to make the comments so.
Shachar