Sorry to resurrect an old subject, and please excuse my ignorance; I am only just getting started on this "character encoding" stuff.
While I was working on the DrawText functions over the past many months I started wondering about when it would fail. (I'm pedantic and such things fascinate me!). The main concern I have is how to walk a W string correctly. For example while "ellipsifying" text we will need to "move the pointer to the previous character" which is currently done by decrementing the pointer by 1. But from what I currently understand that won't work if there are surrogate pairs.
I see that MSDN these days keeps talking about surrogate pairs and the fact that strings of WCHARS might include them and I started to get the impression that the inputs to the W functions were truly UTF-16. I have just been looking at the discussions from April 2000 on this subject and I get the impression that we believe it too.
My understanding of UTF-16 is that a surrogate pair is quite obvious and doesn't require any context information; if you come across a 16 bit value in 0xDC00 to 0xDFFF then it must be the low surrogate and there must be a high surrogate just before it.
So I presumed that CharNextW must be the function that correctly walks a W string and I tried to find out what its behaviour was for the error cases (e.g. high surrogate with no low surrogate). On finding that CharNextW didn't do anything clever, it just incremented by 1, I tested and found that CharNextW always incremented by 1.
So now I am confused.
1. Does anyone know under what circumstances CharNextW isn't +1 (apart from when pointing at the terminating 0)
2. Is e.g. XP really using UTF-16 or is it actually still UCS2?
3 Have we thought about how we should handle walking along a W string (in a fashion that doesn't reduced the speed to a crawl). I guess that in the short term I am expecting some sort of macro or inline.
Bill Medland
On Wed, 9 Jan 2002, Bill Medland wrote:
While I was working on the DrawText functions over the past many months I started wondering about when it would fail. (I'm pedantic and such things fascinate me!). The main concern I have is how to walk a W string correctly. For example while "ellipsifying" text we will need to "move the pointer to the previous character" which is currently done by decrementing the pointer by 1. But from what I currently understand that won't work if there are surrogate pairs.
If you're concerned about that, surrogate pairs are the least of your worries. You should also be concerned about Unicode combining (or composite) characters. I think they might be identified with ctypes C3_NONSPACING and C3_DIACRITIC and that kind of stuff...
- Does anyone know under what circumstances CharNextW isn't +1 (apart from
when pointing at the terminating 0)
Have you tried low surrogate followed by high surrogate, on a Microsoft OS recent enough that Microsoft *might* have thought about preparing it for dealing with surrogates?
- Is e.g. XP really using UTF-16 or is it actually still UCS2?
I don't know. But it probably ought to be UTF16.
3 Have we thought about how we should handle walking along a W string (in a fashion that doesn't reduced the speed to a crawl). I guess that in the short term I am expecting some sort of macro or inline.
With p++, perhaps? There aren't very many circumstances where that is going to be a problem (where unicode composite characters are not also), is there?
Thanks for the repsonse
"Ove Kaaven" ovehk@ping.uio.no wrote in message news:Pine.LNX.4.21.0201092154570.4461-100000@mizar.ping.uio.no...
On Wed, 9 Jan 2002, Bill Medland wrote:
While I was working on the DrawText functions over the past many months
I
started wondering about when it would fail. (I'm pedantic and such
things
fascinate me!). The main concern I have is how to walk a W string correctly. For example while "ellipsifying" text we will need to "move
the
pointer to the previous character" which is currently done by
decrementing
the pointer by 1. But from what I currently understand that won't work
if
there are surrogate pairs.
If you're concerned about that, surrogate pairs are the least of your worries. You should also be concerned about Unicode combining (or composite) characters. I think they might be identified with ctypes C3_NONSPACING and C3_DIACRITIC and that kind of stuff...
Good point.
- Does anyone know under what circumstances CharNextW isn't +1 (apart
from
when pointing at the terminating 0)
Have you tried low surrogate followed by high surrogate, on a Microsoft OS recent enough that Microsoft *might* have thought about preparing it for dealing with surrogates?
Well, that's the complication. I am lazy so I don't fancy the work involved in learning enough to put together a font that actually uses a surrogate pair so that I can test it with ExtTextOut, which is why I took the easy route of assuming that was what CharNextW was for. I guess I'll have to do the hard work since that family of functions are the ones I have seen that seem to suggest they are UTF-16 compatible.
- Is e.g. XP really using UTF-16 or is it actually still UCS2?
I don't know. But it probably ought to be UTF16.
3 Have we thought about how we should handle walking along a W string
(in a
fashion that doesn't reduced the speed to a crawl). I guess that in the short term I am expecting some sort of macro or inline.
With p++, perhaps? There aren't very many circumstances where that is going to be a problem (where unicode composite characters are not also), is there?
No, so they should both probably be handled together.
Ah well, when I am next in there I'll probably just add a FIXME UTF-16 comment or something.
Thanks again