Regression in lstrcmpiA (occurred in late June, NLS related)

List overview All Threads

newer

older

RE: question about copyright and...

Re: winebuild: small docu update

Troy Rollo

1 Oct 2003 1 Oct '03

7:30 a.m.

When lstrcmpiA was moved from ole2nls.c to locale.c, (around 28th June) the results of comparisons in some cases became reversed. For example, the underscore now returns as greater than alphabetic characters, whereas it used to return as less than alphabetic characters. The older behaviour was consistent with Win2k.

The output below is from the following source:

---begin test program--- #include <windows.h> #include <stdio.h>

char *test_strings[] = { "_", "A", "a", "z", "Z", 0 };

void test_string(char *pch) { char **ppch = test_strings;

while (*ppch) { printf("%s\t%s\t%d\n", pch, *ppch, lstrcmpiA(pch, *ppch)); ++ppch; } } int main(int argc, char **argv) { char **ppch = test_strings;

while (*ppch) test_string(*ppch++); return 0; } ---end test program---

---Wine output from immediately before the change--- _ _ 0sorts _ A -1 _ a -1 _ z -1 _ Z -1 A _ 1 A A 0 A a 0 A z -1 A Z -1 a _ 1 a A 0 a a 0 a z -1 a Z -1 z _ 1 z A 1 z a 1 z z 0 z Z 0 Z _ 1 Z A 1 Z a 1 Z z 0 Z Z 0 ---End---

---Wine output from immediately after the change--- _ _ 0 _ A 1 _ a 1 _ z 1 _ Z 1 A _ -1 A A 0 A a 0 A z -1 A Z -1 a _ -1 a A 0 a a 0 a z -1 a Z -1 z _ -1 z A 1 z a 1 z z 0 z Z 0 Z _ -1 Z A 1 Z a 1 Z z 0 Z Z 0 ~---End---

Show replies by date

Troy Rollo

1 Oct 1 Oct

7:59 a.m.

Further investigation reveals another problem in lstrcmpiA: MSDN documents this function as executing what it describes as a "word sort", which results in the words "co-op" and "coop" sorting to the same place. This is almost a correct description of what happens (if the strings come out to be the same after the word sort it appears that it does a regular comparison as well). The attached files demonstrate the divergence of wine in this regard as well as the original regression.

Dmitry Timoshkov

8:25 a.m.

"Troy Rollo" wine@troy.rollo.name wrote:

...

When lstrcmpiA was moved from ole2nls.c to locale.c, (around 28th June) the results of comparisons in some cases became reversed. For example, the underscore now returns as greater than alphabetic characters, whereas it used to return as less than alphabetic characters.

Yes, I'm aware of the problem. Current CX Office CVS has fixes for all the differences we have found so far. Unfortunately a proper fix requires a change of the unicode sort weight tables generated automatically from the unicode.org data base and we (Alexandre, me, other people at Codeweavers) don't know yet how to make it fit with future imports from unicode.org.

Unicode weight tables from MS and unicode.org have huge amount of differences in many absolutely unexpected places...

...

The older behaviour was consistent with Win2k.

... and only with Latin1 locale, failing with others.

-- Dmitry.

Troy Rollo

2 Oct 2 Oct

1:10 a.m.

On Wed, 1 Oct 2003 18:25, Dmitry Timoshkov wrote:

...

...
The older behaviour was consistent with Win2k.

... and only with Latin1 locale, failing with others.

Yes, but it this also means it worked for ASCII-7. Right now it doesn't even work for that. This creates problems for some applications, such as those that incorrectly use lstrcmpA to do binary searches on internal ordered keyword tables where the keywords can include punctuation characters or underscores. It means they fail to find some of their keywords, the result being spurious error results. Since the ASCII-7 range is the same regardless of character set, this wrong use of lstrcmpA happens to work on Windows if all the keywords in such a table are limited to that range.

Dmitry Timoshkov

2:47 a.m.

"Troy Rollo" wine@troy.rollo.name wrote:

...

Yes, but it this also means it worked for ASCII-7. Right now it doesn't even work for that. This creates problems for some applications, such as those that incorrectly use lstrcmpA to do binary searches on internal ordered keyword tables where the keywords can include punctuation characters or underscores. It means they fail to find some of their keywords, the result being spurious error results. Since the ASCII-7 range is the same regardless of character set, this wrong use of lstrcmpA happens to work on Windows if all the keywords in such a table are limited to that range.

The source of all of this is the difference between MS and unicode.org sort weight tables. There is no an easy way to make unicode.org database look like the MS one unfortunately...

-- Dmitry.

Troy Rollo

4:42 a.m.

On Thu, 2 Oct 2003 12:47, Dmitry Timoshkov wrote:

...

The source of all of this is the difference between MS and unicode.org sort weight tables. There is no an easy way to make unicode.org database look like the MS one unfortunately...

Well right now it's not using any table at all - it's just going through to strncmpiW, which is essentially a word-by-word comparison. Presumably the issue now is copyright on the MS version of the table. Do you have anything written down on the differences that you can give me so I can look for work-arounds?

Dmitry Timoshkov

9:28 a.m.

"Troy Rollo" wine@troy.rollo.name wrote:

...

Well right now it's not using any table at all - it's just going through to strncmpiW, which is essentially a word-by-word comparison. Presumably the issue now is copyright on the MS version of the table. Do you have anything written down on the differences that you can give me so I can look for work-arounds?

I'm attaching current diff between CX Office and WineHQ CVS edited manually to remove not related parts, ignoring that in dlls/kernel/tests/locale.c some parts missing in the CX Office CVS got removed. The diff is provided solely for demonstrating what exactly fixes were made and for testing, it's not ready yet for inclusion into the WIneHQ due to reasons explained earlier.

Some areas of interest are CompareString test suite, changes for unicode collation table, and changes in the CompareString implementation.

P.S. Sorry, I compressed the diff since only few of you all might be interested to look at the really boring details...

-- Dmitry.

Uwe Bonnes

7:42 a.m.

...

...
...
...
...
"Dmitry" == Dmitry Timoshkov dmitry@baikal.ru writes:

Dmitry> "Troy Rollo" wine@troy.rollo.name wrote: >> Yes, but it this also means it worked for ASCII-7. Right now it >> doesn't even work for that. This creates problems for some >> applications, such as those that incorrectly use lstrcmpA to do >> binary searches on internal ordered keyword tables where the keywords >> can include punctuation characters or underscores. It means they fail >> to find some of their keywords, the result being spurious error >> results. Since the ASCII-7 range is the same regardless of character >> set, this wrong use of lstrcmpA happens to work on Windows if all the >> keywords in such a table are limited to that range.

Dmitry> The source of all of this is the difference between MS and Dmitry> unicode.org sort weight tables. There is no an easy way to make Dmitry> unicode.org database look like the MS one unfortunately...

Can we perhaps write a tool that dumps those tables on a running MS system as header files that wine can use? Would this be allowable?

Bye

-- Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt --------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

Dmitry Timoshkov

9:34 a.m.

"Uwe Bonnes" bon@elektron.ikp.physik.tu-darmstadt.de wrote:

...

Dmitry> The source of all of this is the difference between MS and
Dmitry> unicode.org sort weight tables. There is no an easy way to make
Dmitry> unicode.org database look like the MS one unfortunately...
Can we perhaps write a tool that dumps those tables on a running MS system as header files that wine can use? Would this be allowable?

I really hope that we could find a solution without doing that.

-- Dmitry.

Troy Rollo

10:06 p.m.

On Thu, 2 Oct 2003 19:34, Dmitry Timoshkov wrote:

...

...
Can we perhaps write a tool that dumps those tables on a running MS system as header files that wine can use? Would this be allowable?

I really hope that we could find a solution without doing that.

Indeed - since doing that would compromise redistribution in Australia. There is a seminal case in which a table contained in a computer program was held to have copyright separately to the computer program itself. Thus to be distributable here (at least), the table either needs to be capable of generation or computation from established objective rules (which would tend to negate copyright), or a method of reproducing the result without the table would need to be devised.

Jakob Eriksson

11:49 a.m.

Uwe Bonnes wrote:

...

Dmitry> The source of all of this is the difference between MS and Dmitry> unicode.org sort weight tables. There is no an easy way to make Dmitry> unicode.org database look like the MS one unfortunately...

Can we perhaps write a tool that dumps those tables on a running MS system as header files that wine can use? Would this be allowable?

Wouldn't the clean-room way be to write regression tests that pass on Windows?

regards, Jakob

Dmitry Timoshkov

2:19 p.m.

"Jakob Eriksson" jakob@vmlinux.org wrote:

...

...
Dmitry> The source of all of this is the difference between MS and Dmitry> unicode.org sort weight tables. There is no an easy way to make Dmitry> unicode.org database look like the MS one unfortunately...

Can we perhaps write a tool that dumps those tables on a running MS system as header files that wine can use? Would this be allowable?

Wouldn't the clean-room way be to write regression tests that pass on Windows?

That's the approach we have chosen so far.

-- Dmitry.

Jeff Smith

2:49 p.m.

--- Dmitry Timoshkov dmitry@baikal.ru wrote:

...

"Jakob Eriksson" jakob@vmlinux.org wrote:

...
...
Dmitry> The source of all of this is the difference between MS and Dmitry> unicode.org sort weight tables. There is no an easy way to make Dmitry> unicode.org database look like the MS one unfortunately...

Can we perhaps write a tool that dumps those tables on a running MS system as header files that wine can use? Would this be allowable?

Wouldn't the clean-room way be to write regression tests that pass on Windows?

That's the approach we have chosen so far.

-- Dmitry.

You mean something like:

======================================================================= #include <windows.h>

unsigned char test_strings[96][2];

int xyz (const void * y, const void * z) { return lstrcmpi(y, z); }

int main(int argc, char *argv[]) { int i;

for (i=0; i<96; i++) sprintf (test_strings[i], "%c", i+0x20); qsort (&test_strings[0][0], 96, 2, xyz); for (i=0; i<96; i++) { printf (" 0x%02x '%s'", test_strings[i][0], test_strings[i]); if ((i == 95) || (lstrcmpi(test_strings[i], test_strings[i+1]))) printf ("\n"); }

return 0; } ======================================================================= [On Windows 2000 Pro] 0x7f '⌂' 0x27 ''' 0x2d '-' 0x20 ' ' 0x21 '!' 0x22 '"' 0x23 '#' 0x24 '$' 0x25 '%' 0x26 '&' 0x28 '(' 0x29 ')' 0x2a '*' 0x2c ',' 0x2e '.' 0x2f '/' 0x3a ':' 0x3b ';' 0x3f '?' 0x40 '@' 0x5b '[' 0x5c '' 0x5d ']' 0x5e '^' 0x5f '_' 0x60 '`' 0x7b '{' 0x7c '|' 0x7d '}' 0x7e '~' 0x2b '+' 0x3c '<' 0x3d '=' 0x3e '>' 0x30 '0' 0x31 '1' 0x32 '2' 0x33 '3' 0x34 '4' 0x35 '5' 0x36 '6' 0x37 '7' 0x38 '8' 0x39 '9' 0x61 'a' 0x41 'A' 0x62 'b' 0x42 'B' 0x43 'C' 0x63 'c' 0x44 'D' 0x64 'd' 0x45 'E' 0x65 'e' 0x66 'f' 0x46 'F' 0x47 'G' 0x67 'g' 0x48 'H' 0x68 'h' 0x69 'i' 0x49 'I' 0x4a 'J' 0x6a 'j' 0x6b 'k' 0x4b 'K' 0x6c 'l' 0x4c 'L' 0x6d 'm' 0x4d 'M' 0x6e 'n' 0x4e 'N' 0x6f 'o' 0x4f 'O' 0x50 'P' 0x70 'p' 0x51 'Q' 0x71 'q' 0x72 'r' 0x52 'R' 0x53 'S' 0x73 's' 0x74 't' 0x54 'T' 0x75 'u' 0x55 'U' 0x76 'v' 0x56 'V' 0x77 'w' 0x57 'W' 0x58 'X' 0x78 'x' 0x59 'Y' 0x79 'y' 0x5a 'Z' 0x7a 'z' =======================================================================

-- Jeff Smith

__________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com

Dmitry Timoshkov

2:57 p.m.

"Jeff Smith" whydoubt@yahoo.com wrote:

...

You mean something like:

[skipped]

Exactly. I have something like that here, the only difference is that I'm dumping full unicode range 0-0xffff, not only first 96 characters.

-- Dmitry.

Shachar Shemesh

8 p.m.

Dmitry Timoshkov wrote:

...

"Jeff Smith" whydoubt@yahoo.com wrote:

...
You mean something like:

[skipped]

Exactly. I have something like that here, the only difference is that I'm dumping full unicode range 0-0xffff, not only first 96 characters.

Isn't the full unicode range significantly larger than 0-0xffff? What about agregates? CJK etc?

Shachar

-- Shachar Shemesh Open Source integration consultant Home page & resume - http://www.shemesh.biz/

Troy Rollo

10:20 p.m.

On Fri, 3 Oct 2003 06:00, Shachar Shemesh wrote:

...

Dmitry Timoshkov wrote:

...

...
Exactly. I have something like that here, the only difference is that I'm dumping full unicode range 0-0xffff, not only first 96 characters.

Isn't the full unicode range significantly larger than 0-0xffff? What about agregates? CJK etc?

The full unicode range (UCS4) is represented by a 32 bit number. Windows uses UTF-16 (not UCS2 as the documentation I think suggests), in which characters in the range dc00-dfff are used in two word sequences to represent the UCS4 characters 0x10000 to 0x10ffff. Thus to deal with the full range of characters Windows can theoretically represent you'd have to have a table with 0x110000-0x400 = 0x10fc00 entries.

Dimitrie O. Paun

3:25 p.m.

On October 2, 2003 10:19 am, Dmitry Timoshkov wrote:

...

That's the approach we have chosen so far.

So, what's the problem with doing something like so:

For all x,y in Unicode print x,y,lstrcmpi(x,y)

(It will generate maybe close to 30GB of output, but it's OK)

Run this on Windows and Wine, compare the result, and generate a sort of patch file to apply to the unicode.org tables. For added points, we can run this on multiple versions of Windows, and only look at things that are immutable between versions...

-- Dimi.

Troy Rollo

10:08 p.m.

On Thu, 2 Oct 2003 21:49, Jakob Eriksson wrote:

...

Wouldn't the clean-room way be to write regression tests that pass on Windows?

This doesn't help avoid the copyright on the table if you in fact reproduce the table.

Dimitrie O. Paun

10:21 p.m.

On Fri, 3 Oct 2003, Troy Rollo wrote:

...

This doesn't help avoid the copyright on the table if you in fact reproduce the table.

Why is that? We're talking here about lstrcmpiA() behaviour, why would a test for

For all x,y in Unicode: print x,y,lstrcmpiA(x,y)

violate the copyright?

-- Dimi.

Troy Rollo

10:30 p.m.

On Fri, 3 Oct 2003 08:21, Dimitrie O. Paun wrote:

...

Why is that? We're talking here about lstrcmpiA() behaviour, why would a test for
For all x,y in Unicode:
print x,y,lstrcmpiA(x,y)

violate the copyright?

I think the suggestion was that the regression tests be used to fabricate the table and then include the resulting fabricated table in Wine. If so, the result would still be copied, although by an indirect means.

Dimitrie O. Paun

10:47 p.m.

On Fri, 3 Oct 2003, Troy Rollo wrote:

...

On Fri, 3 Oct 2003 08:21, Dimitrie O. Paun wrote:

...
Why is that? We're talking here about lstrcmpiA() behaviour, why would a test for
For all x,y in Unicode:
print x,y,lstrcmpiA(x,y)

violate the copyright?
I think the suggestion was that the regression tests be used to fabricate the table and then include the resulting fabricated table in Wine. If so, the result would still be copied, although by an indirect means.

I don't think the result is still copied, if so than you would never be able to run tests. But this is not what I suggested anyway. I said to run the above on Windows and on Wine (which is based on the unicode.org tables). Compare the results, and generate the differences. Use that as a 'patch' to future unicode.org table updates.

-- Dimi.

Troy Rollo

11:30 p.m.

On Fri, 3 Oct 2003 08:47, Dimitrie O. Paun wrote:

...

I said to run the above on Windows and on Wine (which is based on the unicode.org tables). Compare the results, and generate the differences. Use that as a 'patch' to future unicode.org table updates.

Yes, this is a problem for copyright. The result still counts as copied, at least in Australia, the UK and New Zealand. It's arguable in the United States that given Microsoft's position you could bring it within Feist, but if you're using a mechanism that relies on the contents of the table and will necessarily produce the same table, it counts as copying.

Incidentally, going through the differences, is the value for character code 0x34 correct in the Crossover version? All the other characters in the Basic Latin range that have differences are punctuation characters (in fact all the Basic Latin range punctuation characters have differences). 0x34, however is the digit '4', and it would seem odd that it would differ in ways the other digits don't.

Dimitrie O. Paun

3 Oct 3 Oct

4:02 a.m.

On October 2, 2003 07:30 pm, Troy Rollo wrote:

...

Yes, this is a problem for copyright. The result still counts as copied, at least in Australia, the UK and New Zealand.

This doesn't make any sense. It means that we can _never_ have correct behaviour, no matter what we do, even if we magically come up with the same table. This is insane.

-- Dimi.

Troy Rollo

5:34 a.m.

On Fri, 3 Oct 2003 14:02, Dimitrie O. Paun wrote:

...

This doesn't make any sense.

Well when the High Court of Australia considered it they said it was unsatisfactory, which is their way of saying "it sucks, but that's the way it is."

...

It means that we can _never_ have correct behaviour, no matter what we do, even if we magically come up with the same table. This is insane.

In some cases it amounts to that. This is why it's important to try to come up with some way of expressing the contents of the table without the table, or of finding objective rules that can generate the table.

Having compared a few versions of the allkeys database it seems that there have been some changes to the ordering of characters between versions, which leads me to wonder if Microsoft were just using an earlier version of the table. Microsoft's documentation suggests they adhere to version 2.0 of the Unicode standard, whereas the allkeys.txt file immediately accessible on the unicode.org web site is version 3.1.1.

Here's the versions I can find:

2.1.9d8 http://www.unicode.org/reports/tr10/basekeys.txt 2.1.9d8 http://www.unicode.org/reports/tr10/compkeys.txt 3.1.1 http://www.unicode.org/reports/tr10/allkeys-3.1.1.txt 3.1.1d3 http://www.unicode.org/reports/tr10/allkeys-3.1.1d3.txt 3.0.0d5 http://www.unicode.org/reports/tr10/allkeys-4.0.0d5.txt

The 2.1.9d8 file seems after a quick look to be closer to the Crossover version of the table - for example, it has many of the different types of space characters sorted near 0020, which is an aspect of the Crossover table not present in the table based on allkeys.txt (3.1.1), so the theory that Microsoft's results are just based on an earlier version of the standard table is starting to look like it has merit.

Shachar Shemesh

8:21 a.m.

Troy Rollo wrote:

...

The 2.1.9d8 file seems after a quick look to be closer to the Crossover version of the table - for example, it has many of the different types of space characters sorted near 0020, which is an aspect of the Crossover table not present in the table based on allkeys.txt (3.1.1), so the theory that Microsoft's results are just based on an earlier version of the standard table is starting to look like it has merit.

Logically, it doesn't make sense that they did anything else. After all - why would they?

Even if it's not the case, there may be several possible workarounds for this issue. I have a lawer I can consult about this matter, but let's rule out the Unicode 2.0 theory first. I have access to the Unicode 2.0 (printed) book, if that's any help to anyone.

Shachar

-- Shachar Shemesh Open Source integration consultant Home page & resume - http://www.shemesh.biz/

Dmitry Timoshkov

11:38 a.m.

"Troy Rollo" wine@troy.rollo.name wrote:

...

The 2.1.9d8 file seems after a quick look to be closer to the Crossover version of the table - for example, it has many of the different types of space characters sorted near 0020, which is an aspect of the Crossover table not present in the table based on allkeys.txt (3.1.1), so the theory that Microsoft's results are just based on an earlier version of the standard table is starting to look like it has merit.

I've asked a question regarding unicode support and sorting on microsoft.public.win32.programmer.international (26-28 Jun 2003) and have the following answers (UCA == Unicode Collation Algorithm):

"Michael (michka) Kaplan [MS]" michkap@online.microsoft.com wrote:

...

Collation on Windows does not use the UCA -- it predates the UCA and it supports more languages. It is architecurally prepared to handle more languages in the future, and frankly no one wanted to cut the functionality enough to make it UCA-compatible. :-)

and another one:

...

No, it is not. Unicode's weights have been a part of the UCA, which was first a DRAFT Unicode Technical Report in March of 1997. It did not lose its DRAFT status until November of 1999 and not a Unicode Technical Standard until August of 1999.

Windows, on the other hand, has had its architecture in place since NT 3.1 shipped, over a decade ago. How could it be based on the Unicode sort weight tables, which did not exist at that time even in draft form?

-- Dmitry.

Troy Rollo

8 Oct 8 Oct

3:03 a.m.

On Fri, 3 Oct 2003 21:38, Dmitry Timoshkov wrote:

...

I've asked a question regarding unicode support and sorting on microsoft.public.win32.programmer.international (26-28 Jun 2003) and have the following answers (UCA == Unicode Collation Algorithm):

Based on the lines on inquiry this opened up, the tables would almost certainly be within Feist in the US (and similarly probably OK to copy in Canada), but would definitely be within "industrious collection" copyright protection in Australia, New Zealand and the UK.

Of course if we can identify a unicode.org version that's much closer to the Microsoft tables so that only minor adjustments are necessary, the industrious collection copyright can be bypassed.

If that proves not to be possible, then the only choice legally is likely to be to use the closest version (or amalgam) of the unicode.org tables, but provide a facility to allow people in the US to substitute a Windows version of the *.nls files (found, for example in c:\winnt\system32 - sortkey.nls, for instance, is simply 65536 entries of four bytes in length with the expected format).

7936

Age (days ago)

7943

Last active (days ago)

wine-devel@winehq.org

26 comments

8 participants

tags (0)

participants (8)

Dimitrie O. Paun
Dimitrie O. Paun
Dmitry Timoshkov
Jakob Eriksson
Jeff Smith
Shachar Shemesh
Troy Rollo
Uwe Bonnes