Charles Davis cdavis@mymail.mines.edu wrote:
On 11/9/10 12:13 PM, James Mckenzie wrote:
No, it is not a bug in GNU sed. The authors.c file needs to have the erroneous characters for the language used by MacOSX changed to be acceptable?
That ain't gonna fly. I think we should explicitly use a UTF-8 locale (like en_US.UTF-8 or some such) instead of the C locale when sed goes over the AUTHORS file.
Don't shoot the messenger. Maybe we can force the use of sed if it exists in the /usr/bin directory then to get around the 'brokenness' of GNU sed on the Mac? If not, it is a real bear to set the language on a Mac per previous discussions on the Users list.
James McKenzie
On 11/9/10 1:58 PM, James Mckenzie wrote:
Charles Davis cdavis@mymail.mines.edu wrote:
On 11/9/10 12:13 PM, James Mckenzie wrote:
No, it is not a bug in GNU sed. The authors.c file needs to have the erroneous characters for the language used by MacOSX changed to be acceptable?
That ain't gonna fly. I think we should explicitly use a UTF-8 locale (like en_US.UTF-8 or some such) instead of the C locale when sed goes over the AUTHORS file.
Don't shoot the messenger.
Sorry.
The problem with your first idea--removing the bad characters directly from the authors.c file--is that we'd need to use a utility like sed or awk to implement it automatically--which puts us right back where we started. (We could use diff/patch, but is it worth the effort to maintain a patch for this? And would AJ let us put the patch file in Wine? And if not, where would we put it?)
Maybe we can force the use of sed if it exists in the /usr/bin directory then to get around the 'brokenness' of GNU sed on the Mac?
Maybe. But that seems like a hack. A better way might be to detect if we're on Mac OS and using GNU sed; in that case, we use /usr/bin/sed. That's less of a hack, but still a hack.
If not, it is a real bear to set the language on a Mac per previous discussions on the Users list.
That was about setting LANG. Wine always obeys LC_*, and so does sed.
It's not the language that's the problem. It's the encoding. The AUTHORS file is encoded in UTF-8, but GNU sed isn't using UTF-8 because we told it not to (i.e. we told it to use MacRoman because that's the default encoding for the C locale). If we tell it to use UTF-8 (by setting LC_ALL to, for example, 'en_US.UTF-8'), it will process the file correctly.
Unfortunately, I just remembered that the name of the UTF-8 encoding is different on Mac OS ('UTF-8') and Linux ('utf8'). That might prevent us from setting LC_ALL differently. We might end up having to hack around this the way either you or I described.
Chip
On 9 November 2010 22:13, Charles Davis cdavis@mymail.mines.edu wrote:
On 11/9/10 1:58 PM, James Mckenzie wrote:
Charles Davis cdavis@mymail.mines.edu wrote:
On 11/9/10 12:13 PM, James Mckenzie wrote:
No, it is not a bug in GNU sed. The authors.c file needs to have the erroneous characters for the language used by MacOSX changed to be acceptable?
That ain't gonna fly. I think we should explicitly use a UTF-8 locale (like en_US.UTF-8 or some such) instead of the C locale when sed goes over the AUTHORS file.
Don't shoot the messenger.
Sorry.
The problem with your first idea--removing the bad characters directly from the authors.c file--is that we'd need to use a utility like sed or awk to implement it automatically--which puts us right back where we started. (We could use diff/patch, but is it worth the effort to maintain a patch for this? And would AJ let us put the patch file in Wine? And if not, where would we put it?)
Maybe we can force the use of sed if it exists in the /usr/bin directory then to get around the 'brokenness' of GNU sed on the Mac?
Maybe. But that seems like a hack. A better way might be to detect if we're on Mac OS and using GNU sed; in that case, we use /usr/bin/sed. That's less of a hack, but still a hack.
If not, it is a real bear to set the language on a Mac per previous discussions on the Users list.
That was about setting LANG. Wine always obeys LC_*, and so does sed.
It's not the language that's the problem. It's the encoding. The AUTHORS file is encoded in UTF-8, but GNU sed isn't using UTF-8 because we told it not to (i.e. we told it to use MacRoman because that's the default encoding for the C locale). If we tell it to use UTF-8 (by setting LC_ALL to, for example, 'en_US.UTF-8'), it will process the file correctly.
Unfortunately, I just remembered that the name of the UTF-8 encoding is different on Mac OS ('UTF-8') and Linux ('utf8'). That might prevent us from setting LC_ALL differently. We might end up having to hack around this the way either you or I described.
You could use autoconf to detect: 1/ broken handling of UTF-8 characters by sed; 2/ name of LC_ALL flag that handles UTF-8
NOTE: You will need to enumerate available locales as the user may not have en_US present with UTF-8 encoding (e.g. a Spanish-only or Chinese-only system).
Something like:
cat > get_locale.sh < EOF locale -a | while read locale ; do if [[ LC_ALL=$locale sed < authors.c > /dev/null ]] ; then echo $locale exit fi done EOF
This should print a locale that can process the UTF-8 file. It needs cleaning up a bit, but that is the basis of it.
HTH, - Reece
On 11/9/10 3:29 PM, Reece Dunn wrote:
On 9 November 2010 22:13, Charles Daviscdavis@mymail.mines.edu wrote:
On 11/9/10 1:58 PM, James Mckenzie wrote:
Charles Daviscdavis@mymail.mines.edu wrote:
On 11/9/10 12:13 PM, James Mckenzie wrote:
No, it is not a bug in GNU sed. The authors.c file needs to have the erroneous characters for the language used by MacOSX changed to be acceptable?
That ain't gonna fly. I think we should explicitly use a UTF-8 locale (like en_US.UTF-8 or some such) instead of the C locale when sed goes over the AUTHORS file.
Don't shoot the messenger.
Sorry.
The problem with your first idea--removing the bad characters directly from the authors.c file--is that we'd need to use a utility like sed or awk to implement it automatically--which puts us right back where we started. (We could use diff/patch, but is it worth the effort to maintain a patch for this? And would AJ let us put the patch file in Wine? And if not, where would we put it?)
Maybe we can force the use of sed if it exists in the /usr/bin directory then to get around the 'brokenness' of GNU sed on the Mac?
Maybe. But that seems like a hack. A better way might be to detect if we're on Mac OS and using GNU sed; in that case, we use /usr/bin/sed. That's less of a hack, but still a hack.
If not, it is a real bear to set the language on a Mac per previous discussions on the Users list.
That was about setting LANG. Wine always obeys LC_*, and so does sed.
It's not the language that's the problem. It's the encoding. The AUTHORS file is encoded in UTF-8, but GNU sed isn't using UTF-8 because we told it not to (i.e. we told it to use MacRoman because that's the default encoding for the C locale). If we tell it to use UTF-8 (by setting LC_ALL to, for example, 'en_US.UTF-8'), it will process the file correctly.
Unfortunately, I just remembered that the name of the UTF-8 encoding is different on Mac OS ('UTF-8') and Linux ('utf8'). That might prevent us from setting LC_ALL differently. We might end up having to hack around this the way either you or I described.
You could use autoconf to detect: 1/ broken handling of UTF-8 characters by sed; 2/ name of LC_ALL flag that handles UTF-8
NOTE: You will need to enumerate available locales as the user may not have en_US present with UTF-8 encoding (e.g. a Spanish-only or Chinese-only system).
Something like:
cat> get_locale.sh< EOF locale -a | while read locale ; do if [[ LC_ALL=$locale sed< authors.c> /dev/null ]] ; then echo $locale exit fi done EOF
This should print a locale that can process the UTF-8 file. It needs cleaning up a bit, but that is the basis of it.
Thanks Reece.
Charles: You want to do this?
James McKenzie
On 11/9/10 7:58 PM, James McKenzie wrote:
On 11/9/10 3:29 PM, Reece Dunn wrote:
On 9 November 2010 22:13, Charles Daviscdavis@mymail.mines.edu wrote:
On 11/9/10 1:58 PM, James Mckenzie wrote:
Charles Daviscdavis@mymail.mines.edu wrote:
On 11/9/10 12:13 PM, James Mckenzie wrote:
No, it is not a bug in GNU sed. The authors.c file needs to have the erroneous characters for the language used by MacOSX changed to be acceptable?
That ain't gonna fly. I think we should explicitly use a UTF-8 locale (like en_US.UTF-8 or some such) instead of the C locale when sed goes over the AUTHORS file.
Don't shoot the messenger.
Sorry.
The problem with your first idea--removing the bad characters directly from the authors.c file--is that we'd need to use a utility like sed or awk to implement it automatically--which puts us right back where we started. (We could use diff/patch, but is it worth the effort to maintain a patch for this? And would AJ let us put the patch file in Wine? And if not, where would we put it?)
Maybe we can force the use of sed if it exists in the /usr/bin directory then to get around the 'brokenness' of GNU sed on the Mac?
Maybe. But that seems like a hack. A better way might be to detect if we're on Mac OS and using GNU sed; in that case, we use /usr/bin/sed. That's less of a hack, but still a hack.
If not, it is a real bear to set the language on a Mac per previous discussions on the Users list.
That was about setting LANG. Wine always obeys LC_*, and so does sed.
It's not the language that's the problem. It's the encoding. The AUTHORS file is encoded in UTF-8, but GNU sed isn't using UTF-8 because we told it not to (i.e. we told it to use MacRoman because that's the default encoding for the C locale). If we tell it to use UTF-8 (by setting LC_ALL to, for example, 'en_US.UTF-8'), it will process the file correctly.
Unfortunately, I just remembered that the name of the UTF-8 encoding is different on Mac OS ('UTF-8') and Linux ('utf8'). That might prevent us from setting LC_ALL differently. We might end up having to hack around this the way either you or I described.
You could use autoconf to detect: 1/ broken handling of UTF-8 characters by sed; 2/ name of LC_ALL flag that handles UTF-8
NOTE: You will need to enumerate available locales as the user may not have en_US present with UTF-8 encoding (e.g. a Spanish-only or Chinese-only system).
Something like:
cat> get_locale.sh< EOF locale -a | while read locale ; do if [[ LC_ALL=$locale sed< authors.c> /dev/null ]] ; then echo $locale exit fi done EOF
This should print a locale that can process the UTF-8 file. It needs cleaning up a bit, but that is the basis of it.
Thanks Reece.
Charles: You want to do this?
I'm on it.
If you have a patch ready, though, go for it.
Chip
On 11/9/10 8:02 PM, Charles Davis wrote:
On 11/9/10 7:58 PM, James McKenzie wrote:
On 11/9/10 3:29 PM, Reece Dunn wrote:
On 9 November 2010 22:13, Charles Daviscdavis@mymail.mines.edu wrote:
On 11/9/10 1:58 PM, James Mckenzie wrote:
Charles Daviscdavis@mymail.mines.edu wrote:
On 11/9/10 12:13 PM, James Mckenzie wrote: > No, it is not a bug in GNU sed. The authors.c file needs to have > the erroneous characters for the language used by > MacOSX changed to be acceptable? That ain't gonna fly. I think we should explicitly use a UTF-8 locale (like en_US.UTF-8 or some such) instead of the C locale when sed goes over the AUTHORS file.
Don't shoot the messenger.
Sorry.
The problem with your first idea--removing the bad characters directly from the authors.c file--is that we'd need to use a utility like sed or awk to implement it automatically--which puts us right back where we started. (We could use diff/patch, but is it worth the effort to maintain a patch for this? And would AJ let us put the patch file in Wine? And if not, where would we put it?)
Maybe we can force the use of sed if it exists in the /usr/bin directory then to get around the 'brokenness' of GNU sed on the Mac?
Maybe. But that seems like a hack. A better way might be to detect if we're on Mac OS and using GNU sed; in that case, we use /usr/bin/sed. That's less of a hack, but still a hack.
If not, it is a real bear to set the language on a Mac per previous discussions on the Users list.
That was about setting LANG. Wine always obeys LC_*, and so does sed.
It's not the language that's the problem. It's the encoding. The AUTHORS file is encoded in UTF-8, but GNU sed isn't using UTF-8 because we told it not to (i.e. we told it to use MacRoman because that's the default encoding for the C locale). If we tell it to use UTF-8 (by setting LC_ALL to, for example, 'en_US.UTF-8'), it will process the file correctly.
Unfortunately, I just remembered that the name of the UTF-8 encoding is different on Mac OS ('UTF-8') and Linux ('utf8'). That might prevent us from setting LC_ALL differently. We might end up having to hack around this the way either you or I described.
You could use autoconf to detect: 1/ broken handling of UTF-8 characters by sed; 2/ name of LC_ALL flag that handles UTF-8
NOTE: You will need to enumerate available locales as the user may not have en_US present with UTF-8 encoding (e.g. a Spanish-only or Chinese-only system).
Something like:
cat> get_locale.sh< EOF locale -a | while read locale ; do if [[ LC_ALL=$locale sed< authors.c> /dev/null ]] ; then echo $locale exit fi done EOF
This should print a locale that can process the UTF-8 file. It needs cleaning up a bit, but that is the basis of it.
Thanks Reece.
Charles: You want to do this?
I'm on it.
If you have a patch ready, though, go for it.
No, I'm stuck with a problem in richedit. Besides you have more Mac specific knowledge than I do, and I'm happy to say that. Although, if you need a test 'victim' I'm here for you.
James McKenzie
On Nov 9, 2010, at 4:29 PM, Reece Dunn wrote:
You could use autoconf to detect: 1/ broken handling of UTF-8 characters by sed; 2/ name of LC_ALL flag that handles UTF-8
In theory, you only need to set LC_CTYPE, not any other aspect of the locale. And for that, you don't need the language or country. On Mac OS X, the encoding can be bare, such as LC_CTYPE=UTF-8.
The Makefile used to set LANG, then commit 492ac292b918a3369900532e4edfadaeeba32064 changed it to LC_ALL. That wasn't explained. I assume it was because LANG could be superseded by LC_* variables in the user's environment, and that is undesirable.
Perhaps another approach would be to explicitly unset LC_ALL and export LC_CTYPE=UTF-8.
On Nov 9, 2010, at 4:13 PM, Charles Davis wrote:
Unfortunately, I just remembered that the name of the UTF-8 encoding is different on Mac OS ('UTF-8') and Linux ('utf8').
Are you sure about that? Checking on a couple of Linux systems here, the "locale" command reports:
$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" ...
Hmm. However, using a bare encoding for LC_CTYPE doesn't seem to fly on Linux. Darn, so close to a simple fix. :(
-Ken
--- On Wed, 10/11/10, Ken Thomases ken@codeweavers.com wrote:
From: Ken Thomases ken@codeweavers.com Subject: Re: AUTHORS list and the C locale on Mac OS X To: "Reece Dunn" msclrhd@googlemail.com Cc: "wine-devel" wine-devel@winehq.org Date: Wednesday, 10 November, 2010, 20:08 On Nov 9, 2010, at 4:29 PM, Reece Dunn wrote:
You could use autoconf to detect: 1/ broken handling of UTF-8 characters by
sed;
2/ name of LC_ALL flag that handles UTF-8
In theory, you only need to set LC_CTYPE, not any other aspect of the locale. And for that, you don't need the language or country. On Mac OS X, the encoding can be bare, such as LC_CTYPE=UTF-8.
The Makefile used to set LANG, then commit 492ac292b918a3369900532e4edfadaeeba32064 changed it to LC_ALL. That wasn't explained. I assume it was because LANG could be superseded by LC_* variables in the user's environment, and that is undesirable.
Perhaps another approach would be to explicitly unset LC_ALL and export LC_CTYPE=UTF-8.
On Nov 9, 2010, at 4:13 PM, Charles Davis wrote:
Unfortunately, I just remembered that the name of the
UTF-8 encoding is
different on Mac OS ('UTF-8') and Linux ('utf8').
Are you sure about that? Checking on a couple of Linux systems here, the "locale" command reports:
$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" ...
Hmm. However, using a bare encoding for LC_CTYPE doesn't seem to fly on Linux. Darn, so close to a simple fix. :(
mine (fedora x86_64) does the utf8 thing:
# locale LANG=en_GB.utf8 LC_CTYPE="en_GB.utf8" ...
so there is some truth in the reporter's assertion - what it means is that it varies between different linux'es!!!
On Nov 10, 2010, at 2:27 PM, Hin-Tak Leung wrote:
--- On Wed, 10/11/10, Ken Thomases ken@codeweavers.com wrote:
Are you sure about that? Checking on a couple of Linux systems here, the "locale" command reports:
$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" ...
mine (fedora x86_64) does the utf8 thing:
# locale LANG=en_GB.utf8 LC_CTYPE="en_GB.utf8" ...
so there is some truth in the reporter's assertion - what it means is that it varies between different linux'es!!!
I should have been clearer. The output just reflects your environment. So, you have LANG set to en_GB.utf8. I had LANG set to en_US.UTF-8. My only point was to say that the "UTF-8" form is acceptable. It was not to suggest that "utf8" is not, nor that one or the other is a standard.
The real question is: does the Linux C library accept 'UTF-8' in the environment variables? I believe it does, which is useful because that's what Mac OS X requires. (It doesn't accept "utf8".)
For example, the following reports just fine on some Linux systems here:
LC_ALL=en_GB.UTF-8 locale
As does your case:
LC_ALL=en_GB.utf8 locale
But the following both produce some diagnostics indicating that the C library is choking on the value:
LC_ALL=en_GB.bogus locale LC_ALL=en_GB.UTF-9 locale
I take this to mean it's a legitimate test of whether a value is valid. Further, it indicates that (at least some) Linuxes take either form.
Regards, Ken
--- On Wed, 10/11/10, Ken Thomases ken@codeweavers.com wrote:
I should have been clearer. The output just reflects your environment. So, you have LANG set to en_GB.utf8. I had LANG set to en_US.UTF-8. My only point was to say that the "UTF-8" form is acceptable. It was not to suggest that "utf8" is not, nor that one or the other is a standard.
The real question is: does the Linux C library accept 'UTF-8' in the environment variables? I believe it does, which is useful because that's what Mac OS X requires. (It doesn't accept "utf8".)
For example, the following reports just fine on some Linux systems here:
LC_ALL=en_GB.UTF-8 locale
As does your case:
LC_ALL=en_GB.utf8 locale
But the following both produce some diagnostics indicating that the C library is choking on the value:
LC_ALL=en_GB.bogus locale LC_ALL=en_GB.UTF-9 locale
I take this to mean it's a legitimate test of whether a value is valid. Further, it indicates that (at least some) Linuxes take either form.
On my system (fedora 14 x86_64), the valid locales are stored in: /usr/share/X11/locale/ and part of libX11-common
together with /usr/share/X11/locale/locale.alias which defines aliases (like the lowercase/uppercase with without "-" above).
I had an impression that these things used to be glibc-common or glibc-locale, but it seems that they have moved.
On 10 November 2010 22:45, Ken Thomases ken@codeweavers.com wrote:
On Nov 10, 2010, at 2:27 PM, Hin-Tak Leung wrote:
--- On Wed, 10/11/10, Ken Thomases ken@codeweavers.com wrote:
Are you sure about that? Checking on a couple of Linux systems here, the "locale" command reports:
$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" ...
mine (fedora x86_64) does the utf8 thing:
# locale LANG=en_GB.utf8 LC_CTYPE="en_GB.utf8" ...
so there is some truth in the reporter's assertion - what it means is that it varies between different linux'es!!!
I should have been clearer. The output just reflects your environment. So, you have LANG set to en_GB.utf8. I had LANG set to en_US.UTF-8. My only point was to say that the "UTF-8" form is acceptable. It was not to suggest that "utf8" is not, nor that one or the other is a standard.
The real question is: does the Linux C library accept 'UTF-8' in the environment variables? I believe it does, which is useful because that's what Mac OS X requires. (It doesn't accept "utf8".)
For example, the following reports just fine on some Linux systems here:
LC_ALL=en_GB.UTF-8 locale
As does your case:
LC_ALL=en_GB.utf8 locale
But the following both produce some diagnostics indicating that the C library is choking on the value:
LC_ALL=en_GB.bogus locale LC_ALL=en_GB.UTF-9 locale
I take this to mean it's a legitimate test of whether a value is valid. Further, it indicates that (at least some) Linuxes take either form.
I'm getting the same behaviour (Ubuntu 10.10) -- LC_ALL accepts either utf8 or UTF-8 for en_GB, en_IE, etc. The caveat here is that the primary locale needs to exist (and presumably needs to have a UTF-8 valiant present).
That is, as I don't have a French locale (fr_FR) installed on my machine, the following reports errors:
LC_ALL=fr_FR.UTF-8 locale
This means that systems that don't have the English locale installed (en_US or en_GB, whichever is chosen) will still fail.
What is wrong with iterating over the content of `locale -a` or `locale -a | grep -F utf8` to find a UTF-8 based locale? Or even:
LC_ALL=`locale -a | grep -F utf8 | head -n 1` sed ... authors.c
- Reece
On Nov 10, 2010, at 5:00 PM, Reece Dunn wrote:
I'm getting the same behaviour (Ubuntu 10.10) -- LC_ALL accepts either utf8 or UTF-8 for en_GB, en_IE, etc. The caveat here is that the primary locale needs to exist (and presumably needs to have a UTF-8 valiant present).
That is, as I don't have a French locale (fr_FR) installed on my machine, the following reports errors:
LC_ALL=fr_FR.UTF-8 locale
This means that systems that don't have the English locale installed (en_US or en_GB, whichever is chosen) will still fail.
Understood.
What is wrong with iterating over the content of `locale -a` or `locale -a | grep -F utf8` to find a UTF-8 based locale? Or even:
LC_ALL=`locale -a | grep -F utf8 | head -n 1` sed ... authors.c
Nothing's terribly wrong, although I'd make that grep check for either "utf8" or "UTF-8", or it won't work on Mac OS X.
The main point of my first email was the vain hope that setting LC_CTYPE=UTF-8 would be enough, but that hope was dashed. I sent the email, anyway, for no great reason.
Regards, Ken
Well, it looks like this whole discussion has just been rendered moot.
AJ committed 40977bf1d2f0f11a24fd9330dffac264fced2306 to Wine, which makes shell32 store the AUTHORS file as a resource instead of using sed to turn it into an array of strings.
Chip