Hello Alex,
I'll have a go at reviewing your patches as Alexandre didn't take them as is.
On 10/13/2014 05:54 AM, Alex Henrie wrote:
--- a/dlls/kernel32/locale.c +++ b/dlls/kernel32/locale.c
I'm pulling this diff chunk to the front as it is wrong and some of your code is over-complicated based on that assumption.
@@ -1963,7 +2119,7 @@ BOOL WINAPI EnumSystemCodePagesW( CODEPAGE_ENUMPROCW lpfnCodePageEnum, DWORD fla
- flags [I] Character mapping flags
- src [I] Source string buffer
- srclen [I] Length of src (in bytes), or -1 if src is NUL terminated
- dst [O] Destination buffer
- dst [O] Destination buffer, or NULL to compute the required length
- dstlen [I] Length of dst (in WCHARs), or 0 to compute the required length
The only case where dst == NULL is when dstlen == 0 but that already implies computing the required length. dst == NULL and dstlen != 0 is an error and MultiByteToWideChar() bails out.
@@ -1954,6 +1954,162 @@ BOOL WINAPI EnumSystemCodePagesW( CODEPAGE_ENUMPROCW lpfnCodePageEnum, DWORD fla
/***********************************************************************
write_to_w_string
- Helper for utf7_mbstowcs
- RETURNS
- TRUE on success, FALSE on error
- */
+static inline BOOL write_to_w_string(WCHAR *dst, int dstlen, int *index, WCHAR character)
This isn't a good name for the helper as that's not what it does. It is more a "count and optionally write". Also prefixing it utf7 would make it clear that is just a helper for utf7_mbstowcs and not a generic function and thus would remove the need for the comment that it is just a helper for utf7_mbstowcs.
I know you're using this helper to avoid code duplication between the length calculation and the conversion unlike wine_utf8_mbstowcs(). But it complicates the handling of available space in dst. Without seeing the code for both approaches I'm not sure what is best so I'll review it as is.
+{
- if (dst)
You have to force dst = NULL in utf7_mbstowcs() for this to work. Just checking for dstlen > 0 should avoid that.
- {
if (*index >= dstlen)
return FALSE;
dst[*index] = character;
- }
- (*index)++;
- return TRUE;
+}
+/***********************************************************************
utf7_mbstowcs
- UTF-7 to UTF-16 string conversion, helper for MultiByteToWideChar
- RETURNS
- On success, the number of characters written
- On dst buffer overflow, -1
What about the case where it fails to decode a character? E.g. wine_utf8_mbstowcs() has -2 for that case which then translates to SetLastError(ERROR_NO_UNICODE_TRANSLATION). I looked at the tests but you didn't seem to test the GetLastError in the cases with the invalid input characters.
- */
+static int utf7_mbstowcs(const char *src, int srclen, WCHAR *dst, int dstlen) +{
- static const WCHAR base64_decoding_table[] = {
Why is this table an array of WCHAR? The entries aren't WCHARs.
/* \0 */ -1, /* \x01 */ -1, /* \x02 */ -1, /* \x03 */ -1,
This table is really hard to read even in a editor that does syntax highlighting. As you just need 4 characters per column you could have sixteen columns and merge the comments to one comment at the end of the line. Something like this would be more readable as well as being more compact: -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, /* 0x00 - 0x15 */ .... /* '0' - '?' */
/* \x04 */ -1, /* \x05 */ -1, /* \x06 */ -1, /* \a */ -1,
/* \b */ -1, /* \t */ -1, /* \n */ -1, /* \v */ -1,
/* \f */ -1, /* \r */ -1, /* \x0E */ -1, /* \x0F */ -1,
/* \x10 */ -1, /* \x11 */ -1, /* \x12 */ -1, /* \x13 */ -1,
/* \x14 */ -1, /* \x15 */ -1, /* \x16 */ -1, /* \x17 */ -1,
/* \x18 */ -1, /* \x19 */ -1, /* \x1A */ -1, /* \e */ -1,
/* \x1C */ -1, /* \x1D */ -1, /* \x1E */ -1, /* \x1F */ -1,
/* */ -1, /* ! */ -1, /* " */ -1, /* # */ -1,
/* $ */ -1, /* % */ -1, /* & */ -1, /* ' */ -1,
/* ( */ -1, /* ) */ -1, /* * */ -1, /* + */ 62,
/* , */ -1, /* - */ -1, /* . */ -1, /* / */ 63,
/* 0 */ 52, /* 1 */ 53, /* 2 */ 54, /* 3 */ 55,
/* 4 */ 56, /* 5 */ 57, /* 6 */ 58, /* 7 */ 59,
/* 8 */ 60, /* 9 */ 61, /* : */ -1, /* ; */ -1,
/* < */ -1, /* = */ -1, /* > */ -1, /* ? */ -1,
/* @ */ -1, /* A */ 0, /* B */ 1, /* C */ 2,
/* D */ 3, /* E */ 4, /* F */ 5, /* G */ 6,
/* H */ 7, /* I */ 8, /* J */ 9, /* K */ 10,
/* L */ 11, /* M */ 12, /* N */ 13, /* O */ 14,
/* P */ 15, /* Q */ 16, /* R */ 17, /* S */ 18,
/* T */ 19, /* U */ 20, /* V */ 21, /* W */ 22,
/* X */ 23, /* Y */ 24, /* Z */ 25, /* [ */ -1,
/* \ */ -1, /* ] */ -1, /* ^ */ -1, /* _ */ -1,
/* ` */ -1, /* a */ 26, /* b */ 27, /* c */ 28,
/* d */ 29, /* e */ 30, /* f */ 31, /* g */ 32,
/* h */ 33, /* i */ 34, /* j */ 35, /* k */ 36,
/* l */ 37, /* m */ 38, /* n */ 39, /* o */ 40,
/* p */ 41, /* q */ 42, /* r */ 43, /* s */ 44,
/* t */ 45, /* u */ 46, /* v */ 47, /* w */ 48,
/* x */ 49, /* y */ 50, /* z */ 51, /* { */ -1,
/* | */ -1, /* } */ -1, /* ~ */ -1, /* \x7F */ -1
- };
- const char *source_end = src + srclen;
- int dest_index = 0;
- DWORD byte_pair = 0;
- short offset = 0;
- /* MultiByteToWideChar guarantees that srclen > 0 */
- assert(srclen > 0);
This assert isn't needed. Just rewriting the first do..while loop to a while loop avoids it. That eliminates 3 lines of code of which one is a comment pointing to an unneeded assert.
- if (!dstlen)
dst = NULL;
Unneeded if the helper just checks for dstlen > 0.
- do
Should be just "while (src < source_end)" to avoid the assert from above.
- {
if (*src == '+')
{
src++; /* skip the + sign */
if (src >= source_end)
break;
if (*src == '-')
{
/* just a plus sign escaped as +- */
if (!write_to_w_string(dst, dstlen, &dest_index, '+'))
return -1;
src++;
continue;
}
do
{
WCHAR sextet = *src;
Same as for base64_decoding_table, sextet is not a WCHAR.
if (sextet == '-')
{
/* skip over the dash and end base64 decoding */
/* the current, unfinished byte pair is discarded */
src++;
offset = 0;
break;
}
else if (sextet <= 127)
{
sextet = base64_decoding_table[sextet];
if (sextet == (WCHAR)-1)
Unneeded cast. Using a signed type for sextet also avoids the -Wsign-compare compiler warning. Alternatively use ~0 instead of -1 if you want to keep an unsigned type.
{
/* -1 means that the next character of src is not part of a base64 sequence */
This upper part of the comment is better placed in the declaration of base64_decoding_table.
/* in other words, all sextets in this base64 sequence have been processed */
/* the current, unfinished byte pair is discarded */
offset = 0;
break;
}
}
else
{
/* the next character of src is > 127 and therefore not part of a base64 sequence */
/* the current, unfinished byte pair is NOT discarded in this case */
/* this is probably a bug in Windows */
break;
}
byte_pair = (byte_pair << 6) | sextet;
offset += 6;
if (offset >= 16)
{
/* this byte pair is done */
if (!write_to_w_string(dst, dstlen, &dest_index, (byte_pair >> (offset - 16)) & 0xFFFF))
return -1;
offset -= 16;
}
/* this sextet is done */
src++;
} while (src < source_end);
}
else
{
/* we have to convert to unsigned char in case *src > 127 */
if (!write_to_w_string(dst, dstlen, &dest_index, (unsigned char)*src))
return -1;
src++;
}
- } while (src < source_end);
- return dest_index;
+}
+/***********************************************************************
MultiByteToWideChar (KERNEL32.@)
- Convert a multibyte character string into a Unicode string.
bye michael
2014-10-14 16:12 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
@@ -1963,7 +2119,7 @@ BOOL WINAPI EnumSystemCodePagesW( CODEPAGE_ENUMPROCW lpfnCodePageEnum, DWORD fla
- flags [I] Character mapping flags
- src [I] Source string buffer
- srclen [I] Length of src (in bytes), or -1 if src is NUL terminated
- dst [O] Destination buffer
- dst [O] Destination buffer, or NULL to compute the required length
- dstlen [I] Length of dst (in WCHARs), or 0 to compute the required length
The only case where dst == NULL is when dstlen == 0 but that already implies computing the required length. dst == NULL and dstlen != 0 is an error and MultiByteToWideChar() bails out.
Good point. I've removed this chunk and its WideCharToMultiByte counterpart.
/***********************************************************************
write_to_w_string
- Helper for utf7_mbstowcs
- RETURNS
- TRUE on success, FALSE on error
- */
+static inline BOOL write_to_w_string(WCHAR *dst, int dstlen, int *index, WCHAR character)
This isn't a good name for the helper as that's not what it does. It is more a "count and optionally write". Also prefixing it utf7 would make it clear that is just a helper for utf7_mbstowcs and not a generic function and thus would remove the need for the comment that it is just a helper for utf7_mbstowcs.
I've renamed them to utf7_write_w and utf7_write_c.
I know you're using this helper to avoid code duplication between the length calculation and the conversion unlike wine_utf8_mbstowcs(). But it complicates the handling of available space in dst. Without seeing the code for both approaches I'm not sure what is best so I'll review it as is.
+{
- if (dst)
You have to force dst = NULL in utf7_mbstowcs() for this to work. Just checking for dstlen > 0 should avoid that.
Good point. I've made this change as you suggested.
+/***********************************************************************
utf7_mbstowcs
- UTF-7 to UTF-16 string conversion, helper for MultiByteToWideChar
- RETURNS
- On success, the number of characters written
- On dst buffer overflow, -1
What about the case where it fails to decode a character? E.g. wine_utf8_mbstowcs() has -2 for that case which then translates to SetLastError(ERROR_NO_UNICODE_TRANSLATION). I looked at the tests but you didn't seem to test the GetLastError in the cases with the invalid input characters.
The UTF-7 routines in Windows do not check for Unicode compliance. Still, I've added error code checking to all the tests now, which will hopefully make that more clear.
- static const WCHAR base64_decoding_table[] = {
Why is this table an array of WCHAR? The entries aren't WCHARs.
I've changed them to unsigned chars.
/* \0 */ -1, /* \x01 */ -1, /* \x02 */ -1, /* \x03 */ -1,
This table is really hard to read even in a editor that does syntax highlighting. As you just need 4 characters per column you could have sixteen columns and merge the comments to one comment at the end of the line. Something like this would be more readable as well as being more compact: -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, /* 0x00 - 0x15 */
I've changed it to something similar while still listing every individual character:
/* \0 \x01 \x02 \x03 \x04 \x05 \x06 \a */ -1, -1, -1, -1, -1, -1, -1, -1,
- /* MultiByteToWideChar guarantees that srclen > 0 */
- assert(srclen > 0);
This assert isn't needed. Just rewriting the first do..while loop to a while loop avoids it. That eliminates 3 lines of code of which one is a comment pointing to an unneeded assert.
A while loop adds an unnecessary check, but whatever. You and Sebastian Lackner both requested it, so I concede.
sextet = base64_decoding_table[sextet];
if (sextet == (WCHAR)-1)
Unneeded cast. Using a signed type for sextet also avoids the -Wsign-compare compiler warning. Alternatively use ~0 instead of -1 if you want to keep an unsigned type.
Using a signed char in an array subscript causes Wchar-subscripts. I'd just be trading one warning for another.
/* -1 means that the next character of src is not part of a base64 sequence */
This upper part of the comment is better placed in the declaration of base64_decoding_table.
I really think it makes more sense where it is.
2014-10-14 16:59 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
dlls/kernel32/tests/codepage.c | 751 ++++++++++++++++++++++++++++++++++ +++++++ 1 file changed, 751 insertions(+)
while more tests are good that's a lot of lines to have for just one function. So splitting the WCHAR => UTF-7 from the UTF-7 => WCHAR tests should help make it more manageable.
The trouble is that the utf16_to_utf7_test's are reversible, so I can't cleanly categorize the tests as "encoding tests" or "decoding tests", because some tests are both.
Also some of the strings are huge. Most of those seem to be for the WCHAR => UTF-7 and could be probably generated by just using the directly_encodable_table[] from patch 2. If the tests cannot be generated from the table then please split the strings into more manageable chunks, as is they are unreadable / unmaintainable.
Those strings never need to be changed, ever. It's by far easier to generate them once and inline them. I'm not sure that splitting them up will make them much easier to read, either.
Also the due to the size of the patch it is hard to discern what the difference between the different test structs are. Especially as the names are basically equivalent in this context:
- utf16_to_utf7_tests and wcstombs_tests
- utf7_to_utf16_tests and mbstowcs_tests
Could you suggest better names?
Anyway, thanks for taking the time to review this patch set. You made a lot of good points. The latest changes (incorporating your suggestions) are at https://github.com/alexhenrie/wine
-Alex
On 10/15/2014 03:16 AM, Alex Henrie wrote:
2014-10-14 16:12 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
+/***********************************************************************
utf7_mbstowcs
- UTF-7 to UTF-16 string conversion, helper for MultiByteToWideChar
- RETURNS
- On success, the number of characters written
- On dst buffer overflow, -1
What about the case where it fails to decode a character? E.g. wine_utf8_mbstowcs() has -2 for that case which then translates to SetLastError(ERROR_NO_UNICODE_TRANSLATION). I looked at the tests but you didn't seem to test the GetLastError in the cases with the invalid input characters.
The UTF-7 routines in Windows do not check for Unicode compliance. Still, I've added error code checking to all the tests now, which will hopefully make that more clear.
- static const WCHAR base64_decoding_table[] = {
Why is this table an array of WCHAR? The entries aren't WCHARs.
I've changed them to unsigned chars.
You don't have to use the smallest possible type. A short or even int would work too. And then you can use a signed type which doesn't visually conflicts with the -1 in the table.
/* \0 */ -1, /* \x01 */ -1, /* \x02 */ -1, /* \x03 */ -1,
This table is really hard to read even in a editor that does syntax highlighting. As you just need 4 characters per column you could have sixteen columns and merge the comments to one comment at the end of the line. Something like this would be more readable as well as being more compact: -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, /* 0x00 - 0x15 */
I've changed it to something similar while still listing every individual character:
/* \0 \x01 \x02 \x03 \x04 \x05 \x06 \a */ -1, -1, -1, -1, -1, -1, -1, -1,
Still not as compact as I would have done it but it is readable and that's what matters.
- /* MultiByteToWideChar guarantees that srclen > 0 */
- assert(srclen > 0);
This assert isn't needed. Just rewriting the first do..while loop to a while loop avoids it. That eliminates 3 lines of code of which one is a comment pointing to an unneeded assert.
A while loop adds an unnecessary check, but whatever. You and Sebastian Lackner both requested it, so I concede.
Uh? Trying to avoid a loop check with a do..while loop is a micro-optimization. That's the job of the compiler. Our job is to optimize the code for readability.
sextet = base64_decoding_table[sextet];
if (sextet == (WCHAR)-1)
Unneeded cast. Using a signed type for sextet also avoids the -Wsign-compare compiler warning. Alternatively use ~0 instead of -1 if you want to keep an unsigned type.
Using a signed char in an array subscript causes Wchar-subscripts. I'd just be trading one warning for another.
No, it doesn't. Wchar-subscripts complains only about the naked char as that can be either signed or unsigned depending on the platform. Using signed char lets the compiler know that you're aware of the issue and it will leave you alone.
/* -1 means that the next character of src is not part of a base64 sequence */
This upper part of the comment is better placed in the declaration of base64_decoding_table.
I really think it makes more sense where it is.
I was trying to reduce the verbosity of utf7_mbstowcs(). The multi line comments drown the code in between. If you can fit the essence into one comment line, that works too. utf7_wcstombs(), which has less comments, is much more pleasant to read.
2014-10-14 16:59 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
dlls/kernel32/tests/codepage.c | 751 ++++++++++++++++++++++++++++++++++ +++++++ 1 file changed, 751 insertions(+)
while more tests are good that's a lot of lines to have for just one function. So splitting the WCHAR => UTF-7 from the UTF-7 => WCHAR tests should help make it more manageable.
The trouble is that the utf16_to_utf7_test's are reversible, so I can't cleanly categorize the tests as "encoding tests" or "decoding tests", because some tests are both.
Aha.
Also some of the strings are huge. Most of those seem to be for the WCHAR => UTF-7 and could be probably generated by just using the directly_encodable_table[] from patch 2. If the tests cannot be generated from the table then please split the strings into more manageable chunks, as is they are unreadable / unmaintainable.
Those strings never need to be changed, ever. It's by far easier to
Being able to change them is just the icing on the top. We cannot read them, that's the problem.
generate them once and inline them. I'm not sure that splitting them up will make them much easier to read, either.
How should we know those are generated? Or that you didn't miss one of the 256 characters? Those are just big blobs of data. Tests in Wine serve a dual purpose. They don't only make sure that Wine code won't regress but they also document the found behavior. For that the tests need to be readable, akin to comments.
If you generate the tests then generate them programmatically in the test. Something along the lines of:
/* tests which one-byte characters are base64-encoded and which are not */ for (i = 0; i < 256; i++) { wstr[0] = i; ret = WideCharToMultiByte(CP_UTF7, flags, wstr, 2, str, sizeof(str), ...) if (utf7_can_directly_encode(i)) ok(ret == 1 && str[0] == i, "'%c' not represented directly in UTF-7\n", i); else ok(ret == x && str[0] == '+' && str[x-1] == '-', "..."); }
It is: - shorter than the big blob approach - far more readable - it better conveys the intent of the test - trivially to validate
Other tests from wcstombs_tests can be done similarly, with a separate for loop for better readability.
Also the due to the size of the patch it is hard to discern what the difference between the different test structs are. Especially as the names are basically equivalent in this context:
- utf16_to_utf7_tests and wcstombs_tests
- utf7_to_utf16_tests and mbstowcs_tests
Could you suggest better names?
Not at this stage.
Anyway, thanks for taking the time to review this patch set. You made a lot of good points. The latest changes (incorporating your suggestions) are at https://github.com/alexhenrie/wine
You're welcome.
bye michael
2014-10-15 15:15 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
On 10/15/2014 03:16 AM, Alex Henrie wrote:
2014-10-14 16:12 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
- /* MultiByteToWideChar guarantees that srclen > 0 */
- assert(srclen > 0);
This assert isn't needed. Just rewriting the first do..while loop to a while loop avoids it. That eliminates 3 lines of code of which one is a comment pointing to an unneeded assert.
A while loop adds an unnecessary check, but whatever. You and Sebastian Lackner both requested it, so I concede.
Uh? Trying to avoid a loop check with a do..while loop is a micro-optimization. That's the job of the compiler. Our job is to optimize the code for readability.
There's no way that the compiler is verifying that the passed srclen is always greater than 0 and internally converting the loop to a do-while loop as a consequence. The compiler is not that smart. Nevertheless, I'll leave it as a while loop for you.
sextet = base64_decoding_table[sextet];
if (sextet == (WCHAR)-1)
Unneeded cast. Using a signed type for sextet also avoids the -Wsign-compare compiler warning. Alternatively use ~0 instead of -1 if you want to keep an unsigned type.
Using a signed char in an array subscript causes Wchar-subscripts. I'd just be trading one warning for another.
No, it doesn't. Wchar-subscripts complains only about the naked char as that can be either signed or unsigned depending on the platform. Using signed char lets the compiler know that you're aware of the issue and it will leave you alone.
OK. I've switched to signed char and eliminated this typecast.
/* -1 means that the next character of src is not part of a base64 sequence */
This upper part of the comment is better placed in the declaration of base64_decoding_table.
I really think it makes more sense where it is.
I was trying to reduce the verbosity of utf7_mbstowcs(). The multi line comments drown the code in between. If you can fit the essence into one comment line, that works too. utf7_wcstombs(), which has less comments, is much more pleasant to read.
This is where we just disagree. I think the comments are helpful, both because they help the reader to understand the UTF-7 algorithm and because they explain what behavior is a bug in Windows and what is not. I have, however, removed a couple of the smaller comments where the code is self-explanatory.
Also some of the strings are huge. Most of those seem to be for the WCHAR => UTF-7 and could be probably generated by just using the directly_encodable_table[] from patch 2. If the tests cannot be generated from the table then please split the strings into more manageable chunks, as is they are unreadable / unmaintainable.
Those strings never need to be changed, ever. It's by far easier to
Being able to change them is just the icing on the top. We cannot read them, that's the problem.
generate them once and inline them. I'm not sure that splitting them up will make them much easier to read, either.
How should we know those are generated? Or that you didn't miss one of the 256 characters? Those are just big blobs of data. Tests in Wine serve a dual purpose. They don't only make sure that Wine code won't regress but they also document the found behavior. For that the tests need to be readable, akin to comments.
If you generate the tests then generate them programmatically in the test. Something along the lines of:
/* tests which one-byte characters are base64-encoded and which are not */ for (i = 0; i < 256; i++) { wstr[0] = i; ret = WideCharToMultiByte(CP_UTF7, flags, wstr, 2, str, sizeof(str), ...) if (utf7_can_directly_encode(i)) ok(ret == 1 && str[0] == i, "'%c' not represented directly in UTF-7\n", i); else ok(ret == x && str[0] == '+' && str[x-1] == '-', "..."); }
This algorithm doesn't check whether the character was actually encoded the way that Windows would encode it. Checking that the character and subsequent characters are encoded correctly is particularly important because of the Windows bug with characters < 0.
I have now rewritten the tests to programatically generate the input, but there is no way to programatically generate the output without duplicating pretty much the entire Windows UTF-7 algorithms in the tests, which would defeat the purpose of having tests.
As before, the latest code is sitting in https://github.com/alexhenrie/wine/commits/master
-Alex
On 10/16/2014 01:38 AM, Alex Henrie wrote:
2014-10-15 15:15 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
On 10/15/2014 03:16 AM, Alex Henrie wrote:
2014-10-14 16:12 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
- /* MultiByteToWideChar guarantees that srclen > 0 */ +
assert(srclen > 0);
This assert isn't needed. Just rewriting the first do..while loop to a while loop avoids it. That eliminates 3 lines of code of which one is a comment pointing to an unneeded assert.
A while loop adds an unnecessary check, but whatever. You and Sebastian Lackner both requested it, so I concede.
Uh? Trying to avoid a loop check with a do..while loop is a micro-optimization. That's the job of the compiler. Our job is to optimize the code for readability.
There's no way that the compiler is verifying that the passed srclen is always greater than 0 and internally converting the loop to a
It doesn't have to verify srclen as it already knows it is > 0. As MultiByteToWideChar() is the sole user of your function and it uses it only from one place it will most likely just inline it.
do-while loop as a consequence. The compiler is not that smart.
Even if the compiler isn't that smart it just doesn't matters. Even ignoring the fact that you're trading a check in the while loop for a check in assert() it still doesn't matters.
Nevertheless, I'll leave it as a while loop for you.
As I'm passionate about this topic and to save me time in the future by just referencing this email here is a tl;dr full explanation.
Computer Cost: ============== Here is a quick back of the envelope calculation for the costs of the extra check: - The cost for the check in the do while loop will be srclen * cost whereas for the while loop it will be (srclen+1) * cost. - For big srclen the +1 becomes irrelevant. Worst case is srclen == 1 with two times the cost of the check. - What's the cost of the loop check compared to the body of the loop? Lets pick 10% so the overhead of the extra check will melt down to 0.1 * cost of the check in the worst case scenario of srclen == 1. - Reduce the cost overhead again for the whole MultiByteToWideChar() - Factor in the fact that the call to MultiByteToWideChar() won't be in the hot path of the application. - Also the absolute cost of the extra check is cheap and a modern CPU might just paper over it with its branch prediction. - So even with a synthetic test using srclen == 2 I could bet you will have a hard time measuring the overhead. And that only if the compiler is stupid of course.
Reviewer Cost: ============== Now to the costs of the do while loop on the reviewer side: - Look at the diff in the email client - What's that first do while loop for? Scroll down as the big comments make the loop end not fit on one page. - Uh, he isn't checking srclen to be > 0 and uses it as array index. Meh.
Reviewer will branch at that point: - Code is so bad it isn't worth spending time on it. Alexandre with you're not checking the srclen. But I figure others have done it too.
A few will continue: - Why is he doing that? At 2+ Million lines of code we can't just remember all the gory details. - Switch to the window / terminal / desktop with the Wine source code - Load the locale.c file into the editor - Look for MultiByteToWideChar() - Check that indeed srclen cannot ever be < 0 - Go back to review the diff. Oh where was I?
So minutes of reviewer time wasted on a micro optimization. Heavily reduced the number of reviewers willing to look at your patch. For what? Just for a couple of CPU cycles if and only if the compiler is that stupid and the CPU isn't that smart to guess it right in the branch prediction.
Reviewer time is the *most scarce* resource, even scarcer than developer time writing the code in the first place. CPU cycles on the other hand are plenty and cheap.
sextet =
base64_decoding_table[sextet]; + if (sextet == (WCHAR)-1)
Unneeded cast. Using a signed type for sextet also avoids the -Wsign-compare compiler warning. Alternatively use ~0 instead of -1 if you want to keep an unsigned type.
Using a signed char in an array subscript causes Wchar-subscripts. I'd just be trading one warning for another.
No, it doesn't. Wchar-subscripts complains only about the naked char as that can be either signed or unsigned depending on the platform. Using signed char lets the compiler know that you're aware of the issue and it will leave you alone.
OK. I've switched to signed char and eliminated this typecast.
/* -1 means that the next
character of src is not part of a base64 sequence */
This upper part of the comment is better placed in the declaration of base64_decoding_table.
I really think it makes more sense where it is.
I was trying to reduce the verbosity of utf7_mbstowcs(). The multi line comments drown the code in between. If you can fit the essence into one comment line, that works too. utf7_wcstombs(), which has less comments, is much more pleasant to read.
This is where we just disagree. I think the comments are helpful, both
Sorry, I didn't mean having just one line of comment for the whole function. Just condensing the multi line comments in preferably one or two lines of comment. E.g. on the first review I was wondering why you have an else branch with just comments in it, I totally oversaw the break in there.
because they help the reader to understand the UTF-7 algorithm and because they explain what behavior is a bug in Windows and what is not. I have, however, removed a couple of the smaller comments where the code is self-explanatory.
Good. Self explanatory code trumps comments any time.
Also some of the strings are huge. Most of those seem to be for the WCHAR => UTF-7 and could be probably generated by just using the directly_encodable_table[] from patch 2. If the tests cannot be generated from the table then please split the strings into more manageable chunks, as is they are unreadable / unmaintainable.
Those strings never need to be changed, ever. It's by far easier to
Being able to change them is just the icing on the top. We cannot read them, that's the problem.
generate them once and inline them. I'm not sure that splitting them up will make them much easier to read, either.
How should we know those are generated? Or that you didn't miss one of the 256 characters? Those are just big blobs of data. Tests in Wine serve a dual purpose. They don't only make sure that Wine code won't regress but they also document the found behavior. For that the tests need to be readable, akin to comments.
If you generate the tests then generate them programmatically in the test. Something along the lines of:
/* tests which one-byte characters are base64-encoded and which are not */ for (i = 0; i < 256; i++) { wstr[0] = i; ret = WideCharToMultiByte(CP_UTF7, flags, wstr, 2, str, sizeof(str), ...) if (utf7_can_directly_encode(i)) ok(ret == 1 && str[0] == i, "'%c' not represented directly in UTF-7\n", i); else ok(ret == x && str[0] == '+' && str[x-1] == '-', "..."); }
This algorithm doesn't check whether the character was actually encoded the way that Windows would encode it. Checking that the
That's not the essence of the test. It just tests what your comment says it should test.
character and subsequent characters are encoded correctly is particularly important because of the Windows bug with characters < 0.
WCHAR is unsigned so there are no WCHARs < 0. I guess you talk about WCHARs in the range 128 - 255. That would be of course an extra test for loop. But see below too.
I have now rewritten the tests to programatically generate the input, but there is no way to programatically generate the output without duplicating pretty much the entire Windows UTF-7 algorithms in the tests, which would defeat the purpose of having tests.
If you are just want to check the exact mappings that Windows does between WCHAR and UTF-7 it isn't hard to generate. And you can even check the encoding for the whole WCHAR range with a for loop from 0 to WCHAR_MAX: - At 64K WCHARs the src buffer will eat just 128KB. Fill it with a for loop. - The dest buffer assuming a worst case 1 WCHAR to 4 char encoding will be just 256KB. - WideCharToMultiByte() that. - Calculate the SHA1 sum of the UTF-7 stream and compare it to a previously computed SHA1. - MultiByteToWideChar the UTF-7 stream you got. Either compare it directly to the WCHAR src if the operation happens to be symmetric or calculate again the SHA1 sum and compare it to the stored.
Then you can reserve the other tests for the interesting stuff where you really have to store both src and dest. But those should be short strings anyway.
As before, the latest code is sitting in https://github.com/alexhenrie/wine/commits/master
bye michael
2014-10-16 6:25 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
On 10/16/2014 01:38 AM, Alex Henrie wrote:
2014-10-15 15:15 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
On 10/15/2014 03:16 AM, Alex Henrie wrote:
2014-10-14 16:12 GMT-06:00 Michael Stefaniuc mstefani@redhat.com:
- /* MultiByteToWideChar guarantees that srclen > 0 */ +
assert(srclen > 0);
This assert isn't needed. Just rewriting the first do..while loop to a while loop avoids it. That eliminates 3 lines of code of which one is a comment pointing to an unneeded assert.
A while loop adds an unnecessary check, but whatever. You and Sebastian Lackner both requested it, so I concede.
Uh? Trying to avoid a loop check with a do..while loop is a micro-optimization. That's the job of the compiler. Our job is to optimize the code for readability.
There's no way that the compiler is verifying that the passed srclen is always greater than 0 and internally converting the loop to a
It doesn't have to verify srclen as it already knows it is > 0. As MultiByteToWideChar() is the sole user of your function and it uses it only from one place it will most likely just inline it.
OK, you're right. At O1 or higher, the compiler appears to inline the function and rewrite the loop as a do-while loop.
Reviewer Cost:
Now to the costs of the do while loop on the reviewer side:
- Look at the diff in the email client
- What's that first do while loop for? Scroll down as the big comments make the loop end not fit on one page.
- Uh, he isn't checking srclen to be > 0 and uses it as array index. Meh.
Let's imagine that the reviewer is looking at the while loop version. He might say "why is this code using a while loop if the condition is guaranteed to be true the first time through?" Then he'd have to try to think through whether the compiler will optimize that out or not.
This whole problem could be avoided with a do-while loop and one line of comment explaining that srclen is guaranteed to be > 0.
I have now rewritten the tests to programatically generate the input, but there is no way to programatically generate the output without duplicating pretty much the entire Windows UTF-7 algorithms in the tests, which would defeat the purpose of having tests.
If you are just want to check the exact mappings that Windows does between WCHAR and UTF-7 it isn't hard to generate. And you can even check the encoding for the whole WCHAR range with a for loop from 0 to WCHAR_MAX:
- At 64K WCHARs the src buffer will eat just 128KB. Fill it with a for loop.
- The dest buffer assuming a worst case 1 WCHAR to 4 char encoding will be just 256KB.
- WideCharToMultiByte() that.
- Calculate the SHA1 sum of the UTF-7 stream and compare it to a previously computed SHA1.
- MultiByteToWideChar the UTF-7 stream you got. Either compare it directly to the WCHAR src if the operation happens to be symmetric or calculate again the SHA1 sum and compare it to the stored.
Then you can reserve the other tests for the interesting stuff where you really have to store both src and dest. But those should be short strings anyway.
Let's imagine that someone else tinkers with the UTF-7 code, and accidentally introduces a bug. If the tests only compare hashes, how would he have any idea what exactly broke?
If the purpose of the tests is purely to check conformance, then hashes would be fine. But if you want them to aid in debugging as well, they need to compare against unprocessed output.
On second thought, I've rewritten the large tests per your previous suggestion. They no longer check that the exact UTF-7 encoding is correct, but this is not a big deal because other tests already check exact UTF-7 encodings thoroughly. The new tests instead test each character, including characters 0 through FFFF if testing WCHARs.
-Alex
The code would probably benefit from a helper function for the base64 decoding, instead of having a nested do-while loop. Also, it's pointless to have an else-block if you unconditionally break / return in the preceding if-block. Similarly, if you have an unconditional break / return in an else-block, it probably should have just been an if-block instead.
2014-10-17 1:53 GMT-06:00 Henri Verbeet hverbeet@gmail.com:
The code would probably benefit from a helper function for the base64 decoding, instead of having a nested do-while loop. Also, it's pointless to have an else-block if you unconditionally break / return in the preceding if-block. Similarly, if you have an unconditional break / return in an else-block, it probably should have just been an if-block instead.
OK. I've rewritten the inner loop of utf7_mbstowcs to not use else blocks.
I'm not sold on moving the inner loop to its own function, but we'll see what Alexandre says.
-Alex