This happens in code which unmaps message, mapped from ASCII to Unicode. See windows/winproc.c, function WINPROC_UnmapMsg32ATo32W:
case WM_GETTEXTLENGTH: case CB_GETLBTEXTLEN: case LB_GETTEXTLEN: /* there may be one DBCS char for each Unicode char */ return result * 2;
What is the correct way to handle double-byte characters in this situation? How Windows handles this? At least can we return double values when system metrics SM_DBCSENABLED is true? We could have a switch in the config file for this system metrics.
I came across this issue when used default combo box control implementation in Delphi 6. I assume the same issue also exists for edit controls. The returned length is correct if I comment out the code above.
Existing behavior is a possible cause of bug in entering serial numbers - when cursor jumps to the next edit field when only half of text is entered.
Thanks, Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
I am not sure about the specific case, but I do have some experience with handling DBCS in general.
When using TCHAR and defining MBCS (which is the default with VCC - MS doing something nice for a change) the result (if my memory serves me correctly) is an unsigned char. This means that it is the same size as a regular char.
The thing to understand when working with MBCS is that a single byte does not necessarily mean a single character. You get a stream of bytes, some will be 1 byte/character, and some 2.
You are guaranteed against NULL and new line being misrepresented. For that reason alone most byte by byte processing will work on MBCS without a problem. If you are doing no string processing at all, you can simply ignore the MBCS possibility at all.
Things do become messy if you want to either work on a character based calculations (i.e. - I have 7 characters in the string, despite it being 10 bytes long), if you are looking for a particular character ('' is a nasty example), or if you want to traverse the string backwards.
Traversing a MBCS string is akin to a forward iterator in STL. You have a macro (isleadbyte, IIRC) that lets you know whether the next byte is alone or part of a double byte. You are allowed to save the pointer and return to it, but when traversing the string backwards, it is very difficult for you to know whether the previous byte is a single character or not.
Another problem is that the second byte of an MBCS character may be something you will find interesting on its own. Like I said before, one nasty example is when parsing a path and looking for '' separators. There are some Japanese characters that, when coded in MBCS, result is two bytes, the second one being ''. When the proper locale is loaded, Windows knows not to treat this '' as a directory separator, but your programs may fail to do so (does wine?).
These are the main issues when working with MBCS. I hope I have managed to help.
Shachar
Andriy Palamarchuk wrote:
This happens in code which unmaps message, mapped from ASCII to Unicode. See windows/winproc.c, function WINPROC_UnmapMsg32ATo32W:
case WM_GETTEXTLENGTH: case CB_GETLBTEXTLEN: case LB_GETTEXTLEN: /* there may be one DBCS char for each Unicode char */ return result * 2;
What is the correct way to handle double-byte characters in this situation? How Windows handles this? At least can we return double values when system metrics SM_DBCSENABLED is true? We could have a switch in the config file for this system metrics.
I came across this issue when used default combo box control implementation in Delphi 6. I assume the same issue also exists for edit controls. The returned length is correct if I comment out the code above.
Existing behavior is a possible cause of bug in entering serial numbers - when cursor jumps to the next edit field when only half of text is entered.
Thanks, Andriy
Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Shachar, thank you for the detailed response.
MSDN says that message returns number of TCHARS. It looks like two-byte MBCS character has 2 TCHARS. Nasty staff :-(
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Not as nasty as when you find out that when you use CString, the same thing happens there. The MSDN used to say something along the lines of "CString fully supports MBCS", with a footnote stating that all the problems that happen when you work with strings still happen, and you take care of them the same way.
I wound up making a wrapper to stl string class, using the same interface. I did this by using a basic_string<japanese_char>, where japanese_char was a class I wrote. It allocated two bytes for storage, and would collect two bytes if isleadbyte returned true, and one if false. This allowed the class to provide a random access iterator into the string (as opposed to a forward iterator, which is all you can afford using the usual MBCS).
The entire thing makes you appretiate UTF-8. Not only does it provide a bidrectional iterator, but if you are only parsing for ASCII characters, you can completely ignore the fact that it's a UTF-8 string, and parse it as if it were an ASCII string.
Shachar
Andriy Palamarchuk wrote:
Shachar, thank you for the detailed response.
MSDN says that message returns number of TCHARS. It looks like two-byte MBCS character has 2 TCHARS. Nasty staff :-(
Andriy
Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Andriy Palamarchuk wrote:
This happens in code which unmaps message, mapped
from
ASCII to Unicode. See windows/winproc.c, function WINPROC_UnmapMsg32ATo32W:
case WM_GETTEXTLENGTH: case CB_GETLBTEXTLEN: case LB_GETTEXTLEN: /* there may be one DBCS char for each
Unicode
char */ return result * 2;
What is the correct way to handle double-byte characters in this situation?
The best approach I could think of is to send an internal message from this location which returns lengths of Unicode and ASCII strings. This message will be processed only by our controls. If lengths of the Unicode strings are the same this means that both are generated by our code for the same text and I return the A length. If the lengths are different this means length was generated not in our code and I keep existing behavior (return double original Unicode length).
This method looks pretty safe and gives correct behavior in almost all cases.
Comments, suggestions? Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
I think the correct thing to do is return the output of *_tcslen*. Since we don't have TCHARs inside wine, this translates to using wcslen if wer'e a UNICODE function, or strlen if wer'e not (notice that while _mbslen returns the number of characters in the string, strlen returns the number of bytes in the string. See http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclib/html/...).
Also, assuming you are implementing this over my GetFontLanguageInfo patch (the title was "Extremely preliminary BiDi patch" - not commited yet), you can use that function to find out whether special MBCS processing is necessary for the current locale. Just like I said in that email, I am not sure it is worth it, performance wise.
If you do decide to use my patch, notice that GetFontLanguageInfo is a skeleton. I did include an MBCS pattern to those languages I happened to know required MBCS, but as this is far from my main field of experties, errors are not unlikely.
Shachar
Andriy Palamarchuk wrote:
Andriy Palamarchuk wrote:
This happens in code which unmaps message, mapped
from
ASCII to Unicode. See windows/winproc.c, function WINPROC_UnmapMsg32ATo32W:
case WM_GETTEXTLENGTH: case CB_GETLBTEXTLEN: case LB_GETTEXTLEN: /* there may be one DBCS char for each
Unicode
char */ return result * 2;
What is the correct way to handle double-byte characters in this situation?
The best approach I could think of is to send an internal message from this location which returns lengths of Unicode and ASCII strings. This message will be processed only by our controls. If lengths of the Unicode strings are the same this means that both are generated by our code for the same text and I return the A length. If the lengths are different this means length was generated not in our code and I keep existing behavior (return double original Unicode length).
This method looks pretty safe and gives correct behavior in almost all cases.
Comments, suggestions? Andriy
Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
--- Shachar Shemesh wine-devel@sun.consumer.org.il wrote:
I think the correct thing to do is return the output of *_tcslen*. Since we don't have TCHARs inside wine, this translates to using wcslen if wer'e a UNICODE function, or strlen if wer'e not (notice that while _mbslen returns the number of characters in the string, strlen returns the number of bytes in the string. See
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclib/html/...).
Also, assuming you are implementing this over my GetFontLanguageInfo patch (the title was "Extremely preliminary BiDi patch" - not commited yet), you can use that function to find out whether special MBCS processing is necessary for the current locale. Just like I said in that email, I am not sure it is worth it, performance wise.
If you do decide to use my patch, notice that GetFontLanguageInfo is a skeleton. I did include an MBCS pattern to those languages I happened to know required MBCS, but as this is far from my main field of experties, errors are not unlikely.
Sachar, this is a great explanation, but as I understand the processing you mentioned should be done in the controls (e.g. edit field) window procedures. Can you submit a bug for this? You can give better information than I and add dependencies on othe bugs about MBCS, BiDi, etc. (I guess this bug will be assigned to you as the component owner ;-) Can you also make it depend on bug 791?
To my rather limited knowlege the problem with returning double size lays in different area - when a chain of window procedures has in some cases W, in others A procedures. If an A procedure gets results from W procedure it has to map returned results from Unicode to ASCII. The problem is that there is no way of knowing if the returned number of Unicode characters corresponds to length of MBCS which has 2-byte characters or plain 1-byte character string. The current code accounts on the worst possible result - for all the Unicode characters corresponding to 2-byte MBCS characters.
I submitted a bug for this: http://bugs.winehq.com/show_bug.cgi?id=791
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Andriy Palamarchuk wrote:
Sachar, this is a great explanation, but as I understand the processing you mentioned should be done in the controls (e.g. edit field) window procedures. Can you submit a bug for this? You can give better information than I and add dependencies on othe bugs about MBCS, BiDi, etc. (I guess this bug will be assigned to you as the component owner ;-) Can you also make it depend on bug 791?
Ok, I think I am lacking context here. I will have a look at the actual source you were referring sometime during the coming day or two.
Just one clarification. The very fact that the same person who is doing the BiDi support also happens to have experience with MBCS does not mean the two are in any way related. One is centered around west Asia and north Africa (a.k.a. - the middle east), while the other is at east Asia (a.k.a - the far east).
I don't mind receiving ownership over all non-western languages, as I speak a BiDi language, and have experience in programming for an MBCS language (and between the two you are almost cover the entire range of problematic languages), but that does not mean that the two are, in any way, related.
BTW, bringing us to the "ownership" issue. Is wine going to have a "Credits" file, sorted by lines of code submitted or similar criteria? It would be nice to have my name there in case I am ever fired for answering WINE emails while at work ;-).
To my rather limited knowlege the problem with returning double size lays in different area - when a chain of window procedures has in some cases W, in others A procedures. If an A procedure gets results from W procedure it has to map returned results from Unicode to ASCII. The problem is that there is no way of knowing if the returned number of Unicode characters corresponds to length of MBCS which has 2-byte characters or plain 1-byte character string. The current code accounts on the worst possible result
- for all the Unicode characters corresponding to
2-byte MBCS characters.
Like I said before - I'll have a look at the actual context and voice my opinion.
I submitted a bug for this: http://bugs.winehq.com/show_bug.cgi?id=791
Andriy
Shachar
--- Shachar Shemesh wine-devel@sun.consumer.org.il wrote: [skipped]
I don't mind receiving ownership over all non-western languages, as I speak a BiDi language, and have experience in programming for an MBCS language (and between the two you are almost cover the entire range of problematic languages), but that does not mean that the two are, in any way, related.
Hidenori Takeshima did a lot of work with MBCS (this is why I'm CCing him the thread), Dimitry Timoshkov has big experience with internationalization issues.
To clear my position - having a bug assigned to you does not mean you are required to work on it. I already assigned this issue to myself.
BTW, bringing us to the "ownership" issue. Is wine going to have a "Credits" file, sorted by lines of code submitted or similar criteria?
I do not like idea of having such "official" list, like we have list of authors now. Contribution is a gift. One appreciates gifts not basing on their price tag. However, I do not see anything wrong against creating such a list to satisfy curiosity.
It would be nice to have my name there in case I am ever fired for answering WINE emails while at work ;-).
Do not worry, after you are fired you'll have some time to create a script to parse archive of wine-cvs mailing list and generate this information ;-) Changelog, wine-devel can also tell a lot about your work.
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Andriy Palamarchuk wrote:
Can you submit a bug for this? You can give better information than I and add dependencies on othe bugs about MBCS, BiDi, etc. (I guess this bug will be assigned to you as the component owner ;-) Can you also make it depend on bug 791?
Submitted http://bugs.winehq.com/show_bug.cgi?id=794
Marked that this bug depends on 791 (which was already resolved at the time of submitting my bug, but I figured that we better transfer our knowledge into the system).
What say you about placing a FIXME inside that case, so we tell the user we are merely guessing here?
Shachar
--- Shachar Shemesh wine-devel@sun.consumer.org.il wrote: [skipped]
What say you about placing a FIXME inside that case, so we tell the user we are merely guessing here?
On my comments Alexandre said that this is how it is supposed to work: http://www.winehq.com/hypermail/wine-devel/2002/06/0242.html
Alexandre, so this will work correctly or after your yesterday's patch it won't affect anything?
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Andriy Palamarchuk apa3a@yahoo.com writes:
On my comments Alexandre said that this is how it is supposed to work: http://www.winehq.com/hypermail/wine-devel/2002/06/0242.html
Alexandre, so this will work correctly or after your yesterday's patch it won't affect anything?
Well, there may still be similar bugs in other places, but the message translation itself is correct according to MSDN.
--- Alexandre Julliard julliard@winehq.com wrote: [skipped]
Well, there may still be similar bugs in other places, but the message translation itself is correct according to MSDN.
I suggest to close the bug for now and wait for new test cases.
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Alexandre Julliard wrote:
Andriy Palamarchuk apa3a@yahoo.com writes:
On my comments Alexandre said that this is how it is supposed to work: http://www.winehq.com/hypermail/wine-devel/2002/06/0242.html
Alexandre, so this will work correctly or after your yesterday's patch it won't affect anything?
Well, there may still be similar bugs in other places, but the message translation itself is correct according to MSDN.
Personally, I could not find any references to it in MSDN. I have not touched this function before, but it appears to me to be an internal WINE function, right?
The WM_GETTEXTLENGTH MSDN only says that this should be the number of TCHARs returned. nothing about it being an upper bound. If so, any time we reach this code, we potentially return the wrong value, and hence my suggestion for a "FIXME" there.
Shachar
--- Shachar Shemesh wine-devel@sun.consumer.org.il wrote: [skipped]
The WM_GETTEXTLENGTH MSDN only says that this should be the number of TCHARs returned. nothing about it being an upper bound. If so, any time we reach this code, we potentially return the wrong value, and hence my suggestion for a "FIXME" there.
It also says: "Under certain conditions, the DefWindowProc function returns a value that is larger than the actual length of the text. This occurs with certain mixtures of ANSI and Unicode, and is due to the system allowing for the possible existence of double-byte character set (DBCS) characters within the text. The return value, however, will always be at least as large as the actual length of the text; you can thus always use it to guide buffer allocation."
So, I assume that correct application *should* account on such possibility, but I do not account that all applications do ;-)
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Andriy Palamarchuk apa3a@yahoo.com writes:
The best approach I could think of is to send an internal message from this location which returns lengths of Unicode and ASCII strings. This message will be processed only by our controls. If lengths of the Unicode strings are the same this means that both are generated by our code for the same text and I return the A length. If the lengths are different this means length was generated not in our code and I keep existing behavior (return double original Unicode length).
Actually our code should never have that problem because we have both ASCII and Unicode winprocs for all controls. But there were a few ASCII/Unicode mismatches in the combobox, I believe it should be fixed now.
Alexandre,
--- Alexandre Julliard julliard@winehq.com wrote:
Andriy Palamarchuk apa3a@yahoo.com writes:
The best approach I could think of is to send an internal message from this location which returns lengths of Unicode and ASCII strings. This message will be processed only by our controls. If lengths of the Unicode strings are the same
this
means that both are generated by our code for the
same
text and I return the A length. If the lengths are different this means length was generated not in
our
code and I keep existing behavior (return double original Unicode length).
Actually our code should never have that problem because we have both ASCII and Unicode winprocs for all controls. But there were a few ASCII/Unicode mismatches in the combobox, I believe it should be fixed now.
Your patches fix the controls logic but not the problem of mapping return values of WM_GETTEXTLENGTH message from Unicode to ASCII.
Just refreshed Wine and tested. Still see the bug. You can check it yourself using the test case in the bug: http://bugs.winehq.com/show_bug.cgi?id=791 Comment out the mentioned code snippet and it works.
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com
Andriy Palamarchuk apa3a@yahoo.com writes:
Your patches fix the controls logic but not the problem of mapping return values of WM_GETTEXTLENGTH message from Unicode to ASCII.
That's not a problem, that's the way it's supposed to work.
Just refreshed Wine and tested. Still see the bug. You can check it yourself using the test case in the bug: http://bugs.winehq.com/show_bug.cgi?id=791
Seems to work just fine here. Are you sure you updated your tree properly?
--- Alexandre Julliard julliard@winehq.com wrote:
Andriy Palamarchuk apa3a@yahoo.com writes:
Just refreshed Wine and tested. Still see the bug.
You
can check it yourself using the test case in the
bug:
Seems to work just fine here. Are you sure you updated your tree properly?
Yes, it works now for me. I fixed screwed time on my machine after the CVS update and did not bother to update CVS one more time.
Thanks!
Andriy
__________________________________________________ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com