使用MSXML以UTF-8格式保存XML

3

我正在尝试加载一个简单的Xml文件(使用UTF-8编码):

<?xml version="1.0" encoding="UTF-8"?>
<Test/>

并用VBScript中的MSXML保存:

Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")

xmlDoc.Load("C:\test.xml")

xmlDoc.Save "C:\test.xml" 

问题是,MSXML 保存的文件是 ANSI 编码,而不是 UTF-8(尽管原始文件已经采用 UTF-8 编码)。 MSDN MSXML 文档 表示 save() 方法将会以 XML 定义的编码方式写入文件:
“字符编码基于 XML 声明中的 encoding 属性,例如。当未指定 encoding 属性时,默认设置为 UTF-8。”
但至少在我的机器上,这显然没有起作用。
如何让 MSXML 以 UTF-8 的格式保存文件?

2
我没有看到您所报告的行为。当我运行该代码时,它会将XML文档保存在UTF-8中。我得到了一个UTF-8声明,实际的字符串也是UTF-8。 - Cheeso
是的,这很可能只是我的机器(Win2k3)和我同事的机器(Win2k8 64位)出现了这个问题。如果有人能明确地说明为什么不同机器之间的行为不同,那就太好了。 - stung
4个回答

4
你可以使用MSXML中的其他两个类来将XML正确编码并写入输出流。以下是我编写的辅助方法,用于向通用IStream写入内容:
class procedure TXMLHelper.WriteDocumentToStream(const Document60: IXMLDOMDocument2; const stream: IStream; Encoding: string = 'UTF-8');
var
    writer: IMXWriter;
    reader: IVBSAXXMLReader;
begin
{
    From http://support.microsoft.com/kb/275883
    INFO: XML Encoding and DOM Interface Methods

    MSXML has native support for the following encodings:
        UTF-8
        UTF-16
        UCS-2
        UCS-4
        ISO-10646-UCS-2
        UNICODE-1-1-UTF-8
        UNICODE-2-0-UTF-16
        UNICODE-2-0-UTF-8

    It also recognizes (internally using the WideCharToMultibyte API function for mappings) the following encodings:
        US-ASCII
        ISO-8859-1
        ISO-8859-2
        ISO-8859-3
        ISO-8859-4
        ISO-8859-5
        ISO-8859-6
        ISO-8859-7
        ISO-8859-8
        ISO-8859-9
        WINDOWS-1250
        WINDOWS-1251
        WINDOWS-1252
        WINDOWS-1253
        WINDOWS-1254
        WINDOWS-1255
        WINDOWS-1256
        WINDOWS-1257
        WINDOWS-1258
}

    if Document60 = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: Document60 cannot be nil');
    if stream = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: stream cannot be nil');

    // Set properties on the XML writer - including BOM, XML declaration and encoding
    writer := CoMXXMLWriter60.Create;
    writer.byteOrderMark := True; //Determines whether to write the Byte Order Mark (BOM). The byteOrderMark property has no effect for BSTR or DOM output. (Default True)
    writer.omitXMLDeclaration := False; //Forces the IMXWriter to skip the XML declaration. Useful for creating document fragments. (Default False)
    writer.encoding := Encoding; //Sets and gets encoding for the output. (Default "UTF-16")
    writer.indent := True; //Sets whether to indent output. (Default False)
    writer.standalone := True;

    // Set the XML writer to the SAX content handler.
    reader := CoSAXXMLReader60.Create;
    reader.contentHandler := writer as IVBSAXContentHandler;
    reader.dtdHandler := writer as IVBSAXDTDHandler;
    reader.errorHandler := writer as IVBSAXErrorHandler;
    reader.putProperty('http://xml.org/sax/properties/lexical-handler', writer);
    reader.putProperty('http://xml.org/sax/properties/declaration-handler', writer);


    writer.output := stream; //The resulting document will be written into the provided IStream

    // Now pass the DOM through the SAX handler, and it will call the writer
    reader.parse(Document60);

    writer.flush;
end;

为了保存到文件中,我调用带有FileStream参数的Stream版本:
class procedure TXMLHelper.WriteDocumentToFile(const Document60: IXMLDOMDocument2; const filename: string; Encoding: string='UTF-8');
var
    fs: TFileStream;
begin
    fs := TFileStream.Create(filename, fmCreate or fmShareDenyWrite);
    try
        TXMLHelper.WriteDocumentToStream(Document60, fs, Encoding);
    finally
        fs.Free;
    end;
end;

您可以将这些功能转换为任何您喜欢的语言。这些是Delphi。


我打算利用詹姆斯邦德的爱好,尝试恢复这个线程。在C++类型库中,我有以下putProperty方法的定义: HRESULT ISAXXMLReader :: putProperty( unsigned short * pwchName, const _variant_t & varValue ) 这需要无符号短指针作为参数。您是否知道是否有任何支持的属性的枚举或#define,或者我应该如何指定“lexical-handler”和“declaration-handler”属性? - Robertas
@wenaxus我甚至不知道lexical-handlerdeclaration-handler是什么! :) 你最好把这个问题作为一个全新的问题来问。 - Ian Boyd
好的,我只是看到你在回答中使用了它们:reader.putProperty('http://xml.org/sax/properties/lexical-handler', writer);,所以想试一试。 - Robertas

3

所以我猜如果文件中没有Unicode字节,MSXML就无法保存为UTF-8格式? - stung
3
如果一个文件只包含ASCII字符(除了开头的BOM),那么ASCII文件和UTF-8文件在定义上没有区别。 - Kyle Alons

2

当您执行load时,msxml不会将处理指令中的编码复制到创建的文档中。因此文档不包含任何编码,看起来像是msxml选择了它喜欢的内容。在我的环境中,它选择的是我不喜欢的UTF-16。

解决方法是提供处理指令并在那里指定编码。如果您知道文档没有处理指令,则代码非常简单:

Set pi = xmlDoc.createProcessingInstruction("xml", _
         "version=""1.0"" encoding=""windows-1250""")
If xmlDoc.childNodes.Length > 0 Then
  Call xmlDoc.insertBefore(pi, xmlDoc.childNodes.Item(0))
End If

如果有可能,文档中包含其他处理指令,必须首先将其删除(因此下面的代码必须在上面的代码之前)。我不知道如何使用 selectNode 来完成它,所以我只迭代了所有根节点:
For ich=xmlDoc.childNodes.Length-1 to 0 step -1
  Set ch = xmlDoc.childNodes.Item(ich)
  If ch.NodeTypeString = "processinginstruction" and ch.NodeName = "xml" Then
    xmlDoc.removeChild(ch)
  End If
Next ich

抱歉,如果代码不能直接执行,因为我修改了工作版本,它是用一些自定义的东西编写的,而不是vbscript。

我在各处搜索,试图找到一种使用vbScript从xml文件中删除处理指令的方法;这是我看到的唯一一种方法的示例(当我找到它时,我当然感到很愚蠢)。选择节点似乎不起作用,因为处理指令不在文档元素中,而选择节点和Xpath是从那里开始搜索的。 - Pow-Ian

0
这个解决方法似乎有效:将XML字符串传递给ADODB.Stream以保存文件。
更新:根据以下内容,处理指令会丢失:
根据https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms755989(v=vs.85),xml属性始终返回Unicode字符串。也就是说,DOMDocument的xml属性将文档从其原始编码转换为Unicode。因此,原始编码属性被移除。
Sub Facturae()
On Error GoTo ExceptionHandling

Dim i As Integer

'Declare document objects
Dim xDoc As MSXML2.DOMDocument60
Dim xRoot As MSXML2.IXMLDOMElement

'Create new DOMDocument
Set xDoc = New DOMDocument60

'https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms755989(v=vs.85
'Add the XML declaration as a processing instruction:
'Dim xmlDecl As MSXML2.IXMLDOMProcessingInstruction
'Set xmlDecl = xDoc.createProcessingInstruction("xml", "version='1.0' encoding='UTF-8' standalone='yes'")
'xDoc.appendChild xmlDecl

'Create the root element
Set xRoot = xDoc.createElement("fe:Facturae")
xDoc.appendChild xRoot

'The namespace declarations are attributes on the root element, so you can add them using:
xRoot.setAttribute "xmlns:ds", "http://www.w3.org/2000/09/xmldsig#"
xRoot.setAttribute "xmlns:fe", "http://www.facturae.es/Facturae/2009/v3.2/Facturae"
'xDoc.DocumentElement.setAttribute "xmlns:ds", "http://www.w3.org/2000/09/xmldsig#"
'xDoc.DocumentElement.setAttribute "xmlns:fe", "http://www.facturae.es/Facturae/2009/v3.2/Facturae"

'Add child to root
'Create security element
Dim objSecElem As MSXML2.IXMLDOMElement
Set objSecElem = xDoc.createElement("Security")
xRoot.appendChild objSecElem

Dim str(1 To 3) As String
str(1) = "A"
str(2) = "B"
str(3) = "C"

Dim objProp As IXMLDOMElement
For i = 1 To UBound(str)
    Set objProp = xDoc.createElement(str(i))
    objSecElem.appendChild objProp
    objProp.Text = i
Next i

Dim objStream As Stream, strData As String, sFilePath As String
'Debug.Print xDoc.XML
strData = xDoc.XML
sFilePath = ThisWorkbook.Path & "\my_file.xml"

Set objStream = CreateObject("ADODB.Stream")
objStream.Type = adTypeText
objStream.charset = "utf-8"
objStream.LineSeparator = adCRLF
objStream.Open
objStream.WriteText "<?xml version='1.0' encoding='utf-8' standalone='yes'?>", adWriteLine
objStream.WriteText strData, adWriteChar
objStream.SaveToFile sFilePath, adSaveCreateOverWrite

'Save the XML file
'xDoc.Save sFilePath

CleanUp:
    On Error Resume Next
    objStream.Close
    Exit Sub
ExceptionHandling:
    MsgBox "Error: " & Err.Description
    Resume CleanUp
    'https://stackoverflow.com/questions/52205786/excel-vba-global-variables-are-assigned-when-workbook-is-opened-get-erased-if?rq=1
    'error handler
    'Note the unreachable Resume at the end of the error handler. It's unreachable because Resume ExitProcedure will always execute first.
    'So when you get the messagebox, you can use ctrl+break to break into code, which will then take you to the Resume ExitProcedure.
    'You can then drag the yellow arrow over to the unreachable Resume, press F8 to step once which will then take you back to the line that caused the error.
    Resume 'for debugging
End Sub

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接