Re: [PATCH v2 0/4] MR585: Improve is_gecko_path() for handling non-ASCII characters.

8 Jul 2023

      Hi Jacek,
On 8/9/22 18:18, Jacek Caban (@jacek) wrote:
...
Jacek Caban (@jacek) commented about dlls/kernelbase/path.c:
...
      {
          INT ih;
          WCHAR buf[5] = L"0x";

        memcpy(buf + 2, src + 1, 2*sizeof(WCHAR));
        buf[4] = 0;
        StrToIntExW(buf, STIF_SUPPORT_HEX, &ih);

       next = (WCHAR) ih;
        src += 2; /* Advance to end of escape */

       if (flags & URL_UNESCAPE_AS_UTF8)

       {

           utf8_buf[utf8_len++] = ih;

           utf16_len = MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, utf8_buf, utf8_len, NULL, 0);

           if (!utf16_len)

               continue;

This doesn't seem reliable. For example, if there is non-escaped char between escaped multi-byte values, you will end up combining characters surrounding non-escaped one. See JSGlobal_decodeURI for an example how it can be handled.
Sorry for the long delay, it has been really a good while!
The last time I tried the approach in JSGlobal_decodeURI() but I found 
that it doesn't handle 4-bytes UTF-8 very well. So I hung this up.
Anyway, this comes to my sight again recently. In this try, I use 
get_utf8_len() and the first byte of the UTF-8 code for calculating the 
length of the UTF-8 code. Hopefully, this can handle the 'non-escaped 
characters between multi-byte escaped characters' case and 4 bytes 
UTF-8. These cases are added to the test correspondingly.
Thanks

2025

2024

2023

2022

Re: [PATCH v2 0/4] MR585: Improve is_gecko_path() for handling non-ASCII characters.