https://bugs.winehq.org/show_bug.cgi?id=39002
Bug ID: 39002 Summary: ISAXXMLParser ignores charset property Product: Wine Version: unspecified Hardware: x86 OS: Linux Status: UNCONFIRMED Severity: minor Priority: P2 Component: msxml3 Assignee: wine-bugs@winehq.org Reporter: ott@mirix.org Distribution: ---
ISAXXMLParser (or its implementation in Wine to be correct) tries to detect the encoding of documents with xmlDetectCharEncoding and some custom heuristics in internal_parseBuffer and uses xmlDetectCharEncoding without additional heuristics through xmlCreatePushParserCtxt in internal_parseStream which violates the specification:
"This setting [charset property] takes priority over the default encoding, which is implicitly UTF-16, or over the encoding specified in the byte order mark (BOM) of the XML document header." (https://msdn.microsoft.com/en-us/library/ms757826%28v=vs.85%29.aspx)
Moreover, ISAXXMLParser returns E_NOTIMPL when setting the charset property. It should at least accept ASCII, UTF-8 and UTF-16 as these encoding are detected by internal_parseBuffer and internal_parseStream anyways and are supported by libxml2. A better option would be to support all enconding supported by Wine and reencode them as UTF-8 (when possible) and supply the UTF-8 encoding data to libxml2 via custom IO callbacks.
https://bugs.winehq.org/show_bug.cgi?id=39002
--- Comment #1 from Nikolay Sivov bunglehead@gmail.com --- Hi, Matthias.
Thanks for reporting this, do you have a test application that demonstrates this issue?
https://bugs.winehq.org/show_bug.cgi?id=39002
--- Comment #2 from Nikolay Sivov bunglehead@gmail.com --- Matthias, again, do you have an application that depends on this?
https://bugs.winehq.org/show_bug.cgi?id=39002
--- Comment #3 from Matthias-Christian Ott ott@mirix.org --- I have an internal application that sets this property and fails with an error because ISAXXMLParser returns E_NOTIMPL. Unfortunately, this application is an internal application and I can't share it. I only test it with Wine and the production environment is Microsoft Windows.
I can provide a minimal test case if necessary. But I think you can construct an XML document which is not encoded in Unicode and the heuristics will fail. An application should be able to set the encoding based on external information (see appendix section F.2 of the XML 1.0 Recommendation). So an application would set the charset property of ISAXXML to the appropriate encoding for the document.