Is W really UTF-16? - wine-devel

9 Jan 2002


      Sorry to resurrect an old subject, and please excuse my ignorance; I am only
just getting started on this "character encoding" stuff.
While I was working on the DrawText functions over the past many months I
started wondering about when it would fail.  (I'm pedantic and such things
fascinate me!).  The main concern I have is how to walk a W string
correctly.  For example while "ellipsifying" text we will need to "move the
pointer to the previous character" which is currently done by decrementing
the pointer by 1.  But from what I currently understand that won't work if
there are surrogate pairs.
I see that MSDN  these days keeps talking about surrogate pairs and the fact
that strings of WCHARS might include them and I started to get the
impression that the inputs to the W functions were truly UTF-16.  I have
just been looking at the discussions from April 2000 on this subject and I
get the impression that we believe it too.
My understanding of UTF-16 is that a surrogate pair is quite obvious and
doesn't require any context information; if you come across a 16 bit value
in 0xDC00 to 0xDFFF then it must be the low surrogate and there must be a
high surrogate just before it.
So I presumed that CharNextW must be the function that correctly walks a W
string and I tried to find out what its behaviour was for the error cases
(e.g. high surrogate with no low surrogate).  On finding that CharNextW
didn't do anything clever, it just incremented by 1, I tested and found that
CharNextW always incremented by 1.
So now I am confused.
1. Does anyone know under what circumstances CharNextW isn't +1 (apart from
when pointing at the terminating 0)
2. Is e.g. XP really using UTF-16 or is it actually still UCS2?
3 Have we thought about how we should handle walking along a W string (in a
fashion that doesn't reduced the speed to a crawl).  I guess that in the
short term I am expecting some sort of macro or inline.
Bill Medland