Re: Automatic ANSI<>Unicode message translation

5 Aug 2005


      On Wed, 27 Jul 2005 20:34, Alexandre Julliard wrote:
...
Since there is no way of knowing if the target window uses the same
code page, or even if its code page won't change between the time the
message is stored in the queue and when it is retrieved, the only sane
approach is to store messages in the queue in Unicode. Only
SendMessage calls that bypass the queue avoid the translation. I'm
pretty sure that this is what Windows does too, if you have a test
demonstrating the opposite I'd be very interested to see it.
I have just finished running a series of tests using the attached programs - 
msgchar, msgchar2 and msgchar3.
* The short version:
WM_CHAR messages are delivered "immediately" whether sent by SendMessage or 
PostMessage. Where there is a conversion from A->W or W->A, the conversion is 
performed using a modified CP1252 table regardless of the values of 
CP_THREAD_ACP and CP_ACP. Effectively, when sending a message to a window 
that was created by a thread in a different code page, if SendMessageA is 
used, no translation is performed. This was tested on a Win2k system with a 
default of CP1252 (Western Europe) and a WinXP system with a default ACP of 
CP950 (Chinese Traditional).
The table used for the conversion differs from the real CP1252 table in that 
characters 81, 8D, 8F, 90 and 9D, which are unassigned in CP1252, are 
converted to and from the Unicode characters with the same value (+). This 
results in a round-trippable conversion via Unicode, so that for WM_CHAR 
PostMessageA to an ANSI window will always work provided the data is in the 
code page expected by the recipient, but SendMessageA and PostMessageA to a 
Unicode window and SendMessageW and PostMessageW to an ANSI window are only 
guaranteed to get the correct result if CP1252 is used or the messages are 
limited to characters in the range 0x00->0x7f (assuming nothing exotic has 
been done like setting the ACP to an EBCDIC code page).
(+) - The CP1252 table in libs/unicode/c_1252.c does the same thing, but is 
seems Microsoft's CP1252 table also does this despite the fact that every 
published document on the code page says those characters are undefined.
* The long version
The first two programs create windows after setting the thread locale to be 
Chinese Traditional, which results in a CP_THREAD_ACP of 950. They then 
create additional threads with a locale of Japanese, which gets a 
CP_THREAD_ACP of 932. They create windows using both the W and A versions of 
the RegisterClass and CreateWindow API calls, and then tests sending messages 
using SendMessageA and SendMessageW for Unicode character 0x6893 (CP932 0x88 
0xB2 and CP950 0xB1 0xEA). The difference between msgchar and msgchar2 is 
that the first uses GetMessageA/DispatchMessageA and the second uses 
GetMessageW/DispatchMessageW.
The third program tests more characters and the 950->932 direction and 
950->950 transmissions, and was used to verify that a modified CP1252 is what 
is being used.
Note that 0x88, which is a lead byte in CP892, is one of the characters that 
maps outside the Latin1 page in Unicode (CP1252 0x88 is Unicode 0x02C6), 
which makes double-byte characters beginning with that code ideal for these 
tests.
The results were surprising. No matter what I did, when SendMessageW was used 
to send WM_CHAR to a window registered with RegisterClassA, the conversion 
was performed using a modified CP1252 - even if the system code page and 
thread code page for the receiving thread was CP950. When sending WM_CHAR 
using SendMessageA to a window registered with RegisterClassW, the conversion 
was also performed using CP1252 - even if the code page and thread code page 
for the receiving thread was CP950 (and for the sending thread was CP932).
When using SendMessageA to send WM_CHAR to a window registered with 
RegisterClassA, no conversion is performed even if the threads have different 
values for CP_THREAD_ACP.
In other words, where a conversion is performed it is always based on the 
modified CP1252, which has the effect that no visible conversion is ever 
performed for A->A messages.
In the first 3 sets of results (the ones listed as "results.*"), W->W 
PostMessages lose information because GetMessageA and DispatchMessageA are 
used. The next 2 sets of results (listed as "results2.*") do not show this 
loss, suggesting that messages are stored in the queue in "Unicode" based on 
the modified CP1252 conversion.
A 5 second delay was used between all calls to SendMessage and PostMessage, 
and lead bytes were never being held back to wait for the trail bytes.
Obviously Windows is fundamentally broken in the way it handles this. The 
general rule for applications has to be that if IsWindowUnicode is true, use 
SendMessageW with the Unicode character, and if false, use SendMessageA with 
the ANSI character, preferably knowing the code page expected by the 
recipient. Applications should avoid GetMessageA, TranslateMessageA and 
DispatchMessageA and use the W ones exclusively (since they might be 
processing a message for a Unicode window - perhaps the rich edit control?)
I also ran the program under Wine to test its behaviour, which does not match 
the behaviour of Windows at all.
The source to the test programs is attached, together with the output of the 
tests.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: Automatic ANSI<>Unicode message translation