Hello,
In trying to get shell32 a little bit more Unicodified I came across this function ParseFieldA which is taken from shellord.c. I'm quite unfamiliar with Unicode so I still have to learn a lot.
I have finally found most of the string manipulation functions which work for Unicode but when it comes down to simple character comparison I'm a little bit in the dark here.
Some code snippets elsewhere in wine make me believe that for the english charset WCHAR == char is actually mostly true. However I wonder if this can be relied on in code. For instance the Unicode version of ParseField would in that case look like this but I really want the opinion of someone else on, if the code
if (*src++ == ',') nField--;
is actually working as expected on all systems independent of the actually used charsets for the local languages.
And has anyone a good idea what the semi-stub would mean for this function? Maybe that it should ignore commas in quoted strings?
/************************************************************************* * ParseFieldW [internal] * * copies a field from a ',' delimited string * * first field is nField = 1 */ DWORD WINAPI ParseFieldW(LPCWSTR src, DWORD nField, LPWSTR dst, DWORD len) { WARN("(%s,0x%08lx,%p,%ld) semi-stub.\n", debugstr_w(src), nField, dst, len);
if (!src || !src[0] || !dst || !len) return 0;
/* skip n fields delimited by ',' */ while (nField > 1) { if (*src == 0x0) return FALSE; if (*src++ == ',') nField--; }
/* copy part till the next ',' to dst */ while ( *src != 0x0 && *src != ',' && (len--)>0 ) *(dst++) = *(src++);
/* finalize the string */ *dst = 0x0;
return TRUE; }
Rolf Kalbermatter
Hi Rolf, your code should work fine. The beauty of Unicode characters is that you are guaranteed that their size is always the same (well, this would be correct if you use 32-bit Unicode characters, aka UCS-4; but everybody settles for UCS-2, that uses 16-bit characters, and works for almost all languages in the world). So, once you have a WCHAR (=unsigned int), you can use the ++ and -- operator and you know you are going to the next or previous character; on the other hand, using ++ and -- on a char* buffer doesn't guarantee you that you are seeing the next character, as you can be looking at the second byte of a multi-byte character. This is the reason why string manipulation routines should be done in Unicode...
Alberto
On Fri, 6 Dec 2002, Rolf Kalbermatter wrote:
I have finally found most of the string manipulation functions which work for Unicode but when it comes down to simple character comparison I'm a little bit in the dark here.
Some code snippets elsewhere in wine make me believe that for the english charset WCHAR == char is actually mostly true. However I wonder if this can be relied on in code. For instance the Unicode version of ParseField would in that case look like this but I really want the opinion of someone else on, if the code
if (*src++ == ',') nField--;
is actually working as expected on all systems independent of the actually used charsets for the local languages.
It should. All the current code in Wine assumes that ASCII is the common denominator on any system it's built on (a reasonable assumption, pretty much all charsets used today incorporate ASCII); porting Wine to a non-ASCII system is going to be next to impossible. Still, the code above should work in any case as long as the *compiler* respects ASCII (since ASCII is a subset of Unicode), it doesn't matter what charset the actual *user* use.
Rolf Kalbermatter wrote:
Hello,
In trying to get shell32 a little bit more Unicodified I came across this function ParseFieldA which is taken from shellord.c. I'm quite unfamiliar with Unicode so I still have to learn a lot.
I have finally found most of the string manipulation functions which work for Unicode but when it comes down to simple character comparison I'm a little bit in the dark here.
Some code snippets elsewhere in wine make me believe that for the english charset WCHAR == char is actually mostly true. However I wonder if this can be relied on in code. For instance the Unicode version of ParseField would in that case look like this but I really want the opinion of someone else on, if the code
if (*src++ == ',') nField--;
is actually working as expected on all systems independent of the actually used charsets for the local languages.
It's ok to compare a WCHAR with a known char ('A'), but not two WCHARS together.
Explanation - We (as well as Windows) use UTF-16 (UCS-2?) to represent characters. Most common Unicode characters in Europe, Africa, America, Australia and the middle east fit nicely into this area, and there are no problems. Eastern Asia, and some other characters, however, don't.
The characters that don't fit in are represented using Surrogates - i.e. - each character takes two WCHARS to represent. The Unicode standard has been very wise in selecting the surrogates, however. Both first and second WCHARs of any given surrogate are taken from a range that is not allocated for any other character of Unicode. This means that if you are looking for a Hebrew "Aleph", scanning with a piece of code that looks something like: while (*str++ != 0x5d0) is guaranteed not to match anything except "Aleph". This means that if it's a specific character you are looking for, and you know it's not a surrogate, your code will work.
However! If you are trying to look for an occurance of one character inside a string, and neither string nor character are known to you at the time of writing the code, this technique may fail miserably. The reason is that if the character you are looking for is a surrogate, both first and second WCHARs may appear, seperately, in other chars (all surrogates themselves, but still).
Bear that in mind, and everything will be ok.
Shachar