Re: UCS2 vs. UTF16 question

18 Oct 2004


      --- Shachar Shemesh wine-devel@shemesh.biz wrote:
...
I took the liberty of answering to the list. I hope
you don't mind.
Not at all.
...
I most certainly didn't say that. I may have
mentioned UCS4, but to the 
best of my knowledge at the time, Windows uses
UTF-16.
Ah.  It might have been Chris Hertel that said that
then.  The samba folks may see that on the wire.
...
...
WCHARs are in fact fixed-width in Windows?
As far as I know, they are not. Sorry.
Okay.  That's fine.  I'm just trying to understand the
encodings correctly.
...
...
I'm planning to write a tool to detect the
following
...
problematic bit of code:
char str[] = "hi", *p = str + sizeof(str) - 1;
p--;
At least, it's problematic when str contains
double-byte characters.
I'm not sure what you are aiming at achieving. Are
you trying to hit the 
beginning of the last character of the string? If
so, then you do, 
indeed, have a problem here.
Yes, that's what the code's doing.  I'm actually doing
a research project for a class.  My project partner
and I are thinking of using static analysis to detect
this sort of bug.  We can probably just use lexical
analysis to detect other bogus things, like strchr and
strrchr.  We're thinking some tools like this might
help catch some internationalization bugs.
...
In the past I have written programs that had to do
MBCS (the non-unicode 
Japanese encoding). This is an encoding in which
some characters are one 
byte, and some two. The best I could come up with
was to build a wrapper 
around std::string that had two bytes per character
internally. When you 
loaded a string, it would check character by
character for whether it's 
a double byte, and then have each string location
contain exactly one 
character. This allowed random access, as well as
both forward AND 
backwards scanning.
That seems reasonable.
...
Fortunately, UTF is much better than MBCS. Given a
byte in either UTF-8 
or UTF-16, it's fairly easy to figure out whether
it's part of a 
surrogate, and what part. If you have assurance that
the string you are 
handling is a well formed one, you can do backward
scans of a UTF string 
fairly easily.
Indeed.  Like you said, it's the MBCS/DBCS encodings
that are particularly bad in this respect.
...
Do you want a gmail account?
Got one, haven't used it much yet.
Thanks,
--Juan
_______________________________
Do you Yahoo!?
Express yourself with Y! Messenger! Free. Download now. 
http://messenger.yahoo.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: UCS2 vs. UTF16 question