Sorry to resurrect an old subject, and please excuse my ignorance; I am only just getting started on this "character encoding" stuff.
While I was working on the DrawText functions over the past many months I started wondering about when it would fail. (I'm pedantic and such things fascinate me!). The main concern I have is how to walk a W string correctly. For example while "ellipsifying" text we will need to "move the pointer to the previous character" which is currently done by decrementing the pointer by 1. But from what I currently understand that won't work if there are surrogate pairs.
I see that MSDN these days keeps talking about surrogate pairs and the fact that strings of WCHARS might include them and I started to get the impression that the inputs to the W functions were truly UTF-16. I have just been looking at the discussions from April 2000 on this subject and I get the impression that we believe it too.
My understanding of UTF-16 is that a surrogate pair is quite obvious and doesn't require any context information; if you come across a 16 bit value in 0xDC00 to 0xDFFF then it must be the low surrogate and there must be a high surrogate just before it.
So I presumed that CharNextW must be the function that correctly walks a W string and I tried to find out what its behaviour was for the error cases (e.g. high surrogate with no low surrogate). On finding that CharNextW didn't do anything clever, it just incremented by 1, I tested and found that CharNextW always incremented by 1.
So now I am confused.
1. Does anyone know under what circumstances CharNextW isn't +1 (apart from when pointing at the terminating 0)
2. Is e.g. XP really using UTF-16 or is it actually still UCS2?
3 Have we thought about how we should handle walking along a W string (in a fashion that doesn't reduced the speed to a crawl). I guess that in the short term I am expecting some sort of macro or inline.
Bill Medland