Consider the following MSVC program: --------------------- cut ------------------------- // PruebaOpenDlg.cpp : Defines the entry point for the console application. //
#include <stdio.h> #include <stdlib.h> #include <errno.h>
#include <windows.h>
int main(int argc, char* argv[]) { OPENFILENAME ofn; // common dialog box structure char szFile[260]; // buffer for file name
// Initialize OPENFILENAME ZeroMemory(&ofn, sizeof(OPENFILENAME)); ofn.lStructSize = sizeof(OPENFILENAME); ofn.hwndOwner = NULL; ofn.lpstrFile = szFile; ofn.nMaxFile = sizeof(szFile); ofn.lpstrFilter = "All\0*.*\0Text\0*.TXT\0"; ofn.nFilterIndex = 1; ofn.lpstrFileTitle = NULL; ofn.nMaxFileTitle = 0; ofn.lpstrInitialDir = NULL; // ofn.Flags = OFN_PATHMUSTEXIST | OFN_FILEMUSTEXIST;
// Display the Open dialog box. memset(szFile, 0, sizeof(szFile)); if (GetOpenFileName(&ofn)==TRUE) { char * p; FILE * hFile;
printf("Chosen filename is: %s\n", ofn.lpstrFile); printf("Byte encoding is :"); for (p = ofn.lpstrFile; *p; p++) { printf(" (%c %02x)", *p, *p); } printf("\n");
hFile = fopen(ofn.lpstrFile, "rb"); if (hFile != NULL) { fclose(hFile); puts("File is readable through specified filename"); } else { printf("Unable to reach file through %s - %s\n", ofn.lpstrFile, strerror(errno)); } } return 0; } --------------------- cut -------------------------
Consider also the following Linux environment: home directory is /home/alex, and is mapped to drive F: in dosdevices. The home directory contains a directory named gatón (the string contains a [U+00F3 LATIN SMALL LETTER O WITH ACUTE] and is UTF-8 encoded as 0x67 0x61 0x74 0xC3 0xB3 0x6E), inside of which a sample file exists, which is to be selected by the Open File dialog. All tests were made in a Fedora Core 4 system with a *default* LANG=es_EC.UTF-8.
The symptom is that, when wine runs with an UTF-8 locale (as specified with the LANG environment variable), and an attempt is made to choose a filename that is UTF-8 encoded in the filesystem, GetOpenFileNameA may return a byte string that CreateFile and other file functions are unable to map into a valid filename. Whether GetOpenFileNameA returns a valid filename or not seems to depend on the way the navigation is performed. That is, if the application starts the Open File dialog from the current directory, and the user navigates by directory change only, the invalid filename will be returned. However, if the user first chooses a drive letter (such as F:) and then navigates from there, the filename returned is a valid one.
The following tests illustrate the behavior. For each entry, the first two lines are the conditions for the test. The remaining three lines are the actual output from the supplied program, copied and pasted from the console. The instances of \uffff seen are from invalid character encodings displayed in the console.
LANG=en_US From current directory /home/alex: Chosen filename is: f:\gatón\Barenaked Ladies - One Week.mp3 Byte encoding is : (f 66) (: 3a) (\ 5c) (g 67) (a 61) (t 74) (\uffff ffffffc3) (\uffff ffffffb3) (n 6e) (\ 5c) (B 42) (a 61) (r 72) (e 65) (n 6e) (a 61) (k 6b) (e 65) (d 64) ( 20) (L 4c) (a 61) (d 64) (i 69) (e 65) (s 73) ( 20) (- 2d) ( 20) (O 4f) (n 6e) (e 65) ( 20) (W 57) (e 65) (e 65) (k 6b) (. 2e) (m 6d) (p 70) (3 33) File is readable through specified filename
LANG=en_US From explicit choice from drive F: : Chosen filename is: F:\gatón\Barenaked Ladies - One Week.mp3 Byte encoding is : (F 46) (: 3a) (\ 5c) (g 67) (a 61) (t 74) (\uffff ffffffc3) (\uffff ffffffb3) (n 6e) (\ 5c) (B 42) (a 61) (r 72) (e 65) (n 6e) (a 61) (k 6b) (e 65) (d 64) ( 20) (L 4c) (a 61) (d 64) (i 69) (e 65) (s 73) ( 20) (- 2d) ( 20) (O 4f) (n 6e) (e 65) ( 20) (W 57) (e 65) (e 65) (k 6b) (. 2e) (m 6d) (p 70) (3 33) File is readable through specified filename
LANG=es_EC From current directory /home/alex: Chosen filename is: f:\gatón\Barenaked Ladies - One Week.mp3 Byte encoding is : (f 66) (: 3a) (\ 5c) (g 67) (a 61) (t 74) (\uffff ffffffc3) (\uffff ffffffb3) (n 6e) (\ 5c) (B 42) (a 61) (r 72) (e 65) (n 6e) (a 61) (k 6b) (e 65) (d 64) ( 20) (L 4c) (a 61) (d 64) (i 69) (e 65) (s 73) ( 20) (- 2d) ( 20) (O 4f) (n 6e) (e 65) ( 20) (W 57) (e 65) (e 65) (k 6b) (. 2e) (m 6d) (p 70) (3 33) File is readable through specified filename
LANG=es_EC From explicit choice from drive F: : Chosen filename is: F:\gatón\Barenaked Ladies - One Week.mp3 Byte encoding is : (F 46) (: 3a) (\ 5c) (g 67) (a 61) (t 74) (\uffff ffffffc3) (\uffff ffffffb3) (n 6e) (\ 5c) (B 42) (a 61) (r 72) (e 65) (n 6e) (a 61) (k 6b) (e 65) (d 64) ( 20) (L 4c) (a 61) (d 64) (i 69) (e 65) (s 73) ( 20) (- 2d) ( 20) (O 4f) (n 6e) (e 65) ( 20) (W 57) (e 65) (e 65) (k 6b) (. 2e) (m 6d) (p 70) (3 33) File is readable through specified filename
LANG=es_EC.UTF-8 From current directory /home/alex: Chosen filename is: f:\gatón\Barenaked Ladies - One Week.mp3 Byte encoding is : (f 66) (: 3a) (\ 5c) (g 67) (a 61) (t 74) (\uffff ffffffc3) (\uffff ffffffb3) (n 6e) (\ 5c) (B 42) (a 61) (r 72) (e 65) (n 6e) (a 61) (k 6b) (e 65) (d 64) ( 20) (L 4c) (a 61) (d 64) (i 69) (e 65) (s 73) ( 20) (- 2d) ( 20) (O 4f) (n 6e) (e 65) ( 20) (W 57) (e 65) (e 65) (k 6b) (. 2e) (m 6d) (p 70) (3 33) Unable to reach file through f:\gatón\Barenaked Ladies - One Week.mp3 - No such file or directory
LANG=es_EC.UTF-8 From explicit choice from drive F: : Chosen filename is: F:\gat\uffffn\Barenaked Ladies - One Week.mp3 Byte encoding is : (F 46) (: 3a) (\ 5c) (g 67) (a 61) (t 74) (\uffff fffffff3) (n 6e) (\ 5c) (B 42) (a 61) (r 72) (e 65) (n 6e) (a 61) (k 6b) (e 65) (d 64) ( 20) (L 4c) (a 61) (d 64) (i 69) (e 65) (s 73) ( 20) (- 2d) ( 20) (O 4f) (n 6e) (e 65) ( 20) (W 57) (e 65) (e 65) (k 6b) (. 2e) (m 6d) (p 70) (3 33) File is readable through specified filename
Case 5 is incorrect, but is the easiest to hit in the UTF-8 locales.
This problem is significant because all Fedora distributions since at least Fedora Core 2 have UTF-8 support, which is probably enabled in non-US locales. Other popular distributions probably have this UTF-8 support enabled too. I am posting this on wine-devel instead of creating a bug report because I wanted to receive some comments on what the expected behavior should be before trying to submit a patch myself. Unless somebody says otherwise, I would try to submit a patch that makes case 5 behave like case 6, by modifying the encoding of the ANSI string to match what the file-open functions would expect for the filename. However, this essentially requires an answer to the following question: should non-Unicode strings that represent filenames be UTF-8 encoded, or locale encoded? In the UTF-8 locales, GetOpenFileNameA seems to think UTF-8 encoded sometimes, but the file open functions expect locale-encoded (in my case is ISO-8859-1). Therefore, the incorrect behavior. How would the answer change (if at all) for Chinese or Japanese locales with a need for multibyte characters?
Alex Villacís Lasso
Hi Alex,
On Monday 21 November 2005 17:23, Alex Villacís Lasso wrote:
Whether GetOpenFileNameA returns a valid filename or not seems to depend on the way the navigation is performed. That is, if the application starts the Open File dialog from the current directory, and the user navigates by directory change only, the invalid filename will be returned. However, if the user first chooses a drive letter (such as F:) and then navigates from there, the filename returned is a valid one.
Seems to be a bug in the unixfs namespace extension. I will have a look into this as soon as time permits.
Bye,
Michael Jung wrote:
Hi Alex,
On Monday 21 November 2005 17:23, Alex Villacís Lasso wrote:
Whether GetOpenFileNameA returns a valid filename or not seems to depend on the way the navigation is performed. That is, if the application starts the Open File dialog from the current directory, and the user navigates by directory change only, the invalid filename will be returned. However, if the user first chooses a drive letter (such as F:) and then navigates from there, the filename returned is a valid one.
Seems to be a bug in the unixfs namespace extension. I will have a look into this as soon as time permits.
Bye,
I was rather hoping for an explanation of which is the "correct" behavior for an UTF-8 locale: 1) Open File Dialog returns an UTF-8 encoded string (visible to the application, current behavior), and open-file functions expect UTF-8 2) Open File Dialog returns locale-encoded string (even in the UTF-8 locale), and open-file functions expect locale-encoding (as they do now)
Your comment strongly suggests (2) is the correct approach, but what happens in East Asian locales? Or am I just demonstrating a lack of knowledge on how non-UTF8 encodings work?
Alex Villacís Lasso
Alex Villacís Lasso a_villacis@palosanto.com writes:
I was rather hoping for an explanation of which is the "correct" behavior for an UTF-8 locale:
- Open File Dialog returns an UTF-8 encoded string (visible to the
application, current behavior), and open-file functions expect UTF-8 2) Open File Dialog returns locale-encoded string (even in the UTF-8 locale), and open-file functions expect locale-encoding (as they do now)
2) is the right behavior. The A functions always return strings in the Ansi codepage, not in the Unix one. There is no Windows locale that uses UTF-8 as Ansi codepage, so if a UTF-8 string is returned to the application that's a bug.
Your comment strongly suggests (2) is the correct approach, but what happens in East Asian locales? Or am I just demonstrating a lack of knowledge on how non-UTF8 encodings work?
Asian locales will use one of the double-byte codepages.
On Tue, 22 Nov 2005 06:54, Alexandre Julliard wrote:
- is the right behavior. The A functions always return strings in the
Ansi codepage, not in the Unix one. There is no Windows locale that uses UTF-8 as Ansi codepage, so if a UTF-8 string is returned to the application that's a bug.
This is of course correct for the GetOpenFileNameA and GetSaveFileNameA.
For the UNIX path entry points in the UNIX paths branch (WineOSFSGetOpenFileNameA and WineOSFSGetSaveFileNameA), the correct behaviour is to return the string in the UNIX locale character set.
The difficulty for the UnixFS code is that it sometimes has to convert using the UNIX locale character set (CP_UNIX), and sometimes with the Win32 one (CP_ACP), depending on the context. It may be that in some cases the wrong one is being used.
Hi,
On Monday 21 November 2005 18:38, Alex Villacís Lasso wrote:
I was rather hoping for an explanation of which is the "correct" behavior for an UTF-8 locale:
Sorry. I guess I'm not that competent when it comes to character encoding stuff. But then, Alexandre and Troy already answered your question already anyway.
On Monday 21 November 2005 20:54, Alexandre Julliard wrote:
Asian locales will use one of the double-byte codepages.
Are you saying that on an asian system CP_ACP is actually a double byte encoding? Is anybody on the list using an asian locale on her system? Does it break the unixfs extension?
Alex, could you please try if the attached patch fixes the problem?
Bye,
On Tue, 22 Nov 2005 19:35, Michael Jung wrote:
Are you saying that on an asian system CP_ACP is actually a double byte encoding? Is anybody on the list using an asian locale on her system? Does it break the unixfs extension?
The Chinese, Japanese and Korean code pages ( 932, 936, 949 and 950) use two byte sequences for some characters and a single byte for some others (most notably the ASCII-7 range). There are some complications - for example, where a double byte character is being sent through a CHAR (not CHAR *) interface that eventually needs to pass the single CHAR on to an interface taking a WCHAR, then the CHAR is treated as if it were in CP1252, and for symmetry in such cases the reverse conversion does the same thing.