保留lxml.etree解析的XML文件的原始doctype和声明

Question

保留lxml.etree解析的XML文件的原始doctype和声明

pythonlxmldoctypexml-declaration

18

我正在使用Python的lxml库，尝试读取XML文档并进行修改后再写回，但原始的DOCTYPE和XML声明会消失。我想知道是否有一种简单的方法可以通过lxml或其他解决方案将其放回去?

- incognito2

你有阅读过 tostring 方法的文档吗？我认为它会自动保留 DOCTYPE。 - John Keyes

你可以使用 tostring 方法添加文档类型和声明，但我需要先解析信息。lxml 似乎一开始就不保留文档类型或声明。 - incognito2

2个回答

7

您可以使用fromstring()来保留DOCTYPE和XML声明：

import sys
from StringIO import StringIO
from lxml import etree

xml = r'''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
 <head>
 <title>example</title>
 </head>
 <body>
 <p>This is an example</p>
 </body>
</html>'''

tree = etree.fromstring(xml).getroottree() # or etree.parse(file)
tree.write(sys.stdout, xml_declaration=True, encoding=tree.docinfo.encoding)

输出

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
 <title>example</title>
 </head>
 <body>
 <p>This is an example</p>
 </body>
</html>

请注意xml声明（包括正确的编码）和文档类型已经存在。甚至在xml声明中使用了'代替"，并在<head>中添加了Content-Type（可能不正确）。对于@John Keyes 的示例输入，它产生了与答案中etree.tostring()相同的结果。

- jfs

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- John Keyes · Accepted Answer

总结：

# adds declaration with version and encoding regardless of
# which attributes were present in the original declaration
# expects utf-8 encoding (encode/decode calls)
# depending on your needs you might want to improve that
from lxml import etree
from xml.dom.minidom import parseString
xml1 = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "example.dtd">
<root>...</root>
'''
xml2 = '''\
<root>...</root>
'''
def has_xml_declaration(xml):
    return parseString(xml).version
def process(xml):
    t = etree.fromstring(xml.encode()).getroottree()
    if has_xml_declaration(xml):
        print(etree.tostring(t, xml_declaration=True, encoding=t.docinfo.encoding).decode())
    else:
        print(etree.tostring(t).decode())
process(xml1)
process(xml2)

以下内容将包括DOCTYPE和XML声明：

from lxml import etree
from StringIO import StringIO

tree = etree.parse(StringIO('''<?xml version="1.0" encoding="iso-8859-1"?>
 <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
  <root>
   <a>&tasty;</a>
 </root>
'''))

docinfo = tree.docinfo
print etree.tostring(tree, xml_declaration=True, encoding=docinfo.encoding)

注意，如果您创建一个元素（例如使用fromstring），则tostring不会保留DOCTYPE，它仅在使用parse处理XML时起作用。

更新：正如J.F. Sebastian所指出的那样，我关于fromstring的断言是错误的。

以下是一些代码，以突出Element和ElementTree序列化之间的差异：

from lxml import etree
from StringIO import StringIO

xml_str = '''<?xml version="1.0" encoding="iso-8859-1"?>
 <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
  <root>
   <a>&tasty;</a>
 </root>
'''

# get the ElementTree using parse
parse_tree = etree.parse(StringIO(xml_str))
encoding = parse_tree.docinfo.encoding
result = etree.tostring(parse_tree, xml_declaration=True, encoding=encoding)
print "%s\nparse ElementTree:\n%s\n" % ('-'*20, result)

# get the ElementTree using fromstring
fromstring_tree = etree.fromstring(xml_str).getroottree()
encoding = fromstring_tree.docinfo.encoding
result = etree.tostring(fromstring_tree, xml_declaration=True, encoding=encoding)
print "%s\nfromstring ElementTree:\n%s\n" % ('-'*20, result)

# DOCTYPE is lost, and no access to encoding
fromstring_element = etree.fromstring(xml_str)
result = etree.tostring(fromstring_element, xml_declaration=True)
print "%s\nfromstring Element:\n%s\n" % ('-'*20, result)

输出结果为：

--------------------
parse ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
   <a>eggs</a>
 </root>

--------------------
fromstring ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
   <a>eggs</a>
 </root>

--------------------
fromstring Element:
<?xml version='1.0' encoding='ASCII'?>
<root>
   <a>eggs</a>
 </root>