用Python比较两个XML文件

Question

用Python比较两个XML文件

15

我刚开始学习Python编程，并且对其中的概念有些困惑。我希望比较两个XML文件，这些XML文件相当大。我会给出一个类似于要比较的文件类型的示例。

xmlfile1：

<xml>
    <property1>
          <property2>    
               <property3>

               </property3>
          </property2>    
    </property1>    
</xml>

XML文件2：

<xml>
    <property1>
        <property2>    
            <property3> 
                <property4>

                </property4>    
            </property3>
        </property2>    
    </property1>

</xml>

我所命名的property1和property2与实际文件中的不同。在xml文件中有很多属性。我希望比较这两个xml文件。

我正在使用lxml解析器尝试比较这两个文件并打印出它们之间的差异。

我不知道如何自动解析并比较它们。

我尝试阅读lxml解析器，但我不明白如何将其用于解决我的问题。

请问有人可以告诉我如何解决这个问题吗？

代码片段可能非常有用。

还有一个问题，我是否遵循了正确的概念或漏掉了其他东西？请指出您所知道的任何新概念并纠正我。

- sankar

你在输出中寻找什么 - 如果只是差异，你可能想在Linux中使用diff或在Windows中使用fc。 - gkusner

实际上我想知道文件的哪一部分已经被更改了。 - sankar

4个回答

7

我的解决问题的方法是将每个 XML 转换为 xml.etree.ElementTree，并迭代每个层级。我还包括了在比较时忽略属性列表的功能。

第一个代码块包含所使用的类：

import xml.etree.ElementTree as ET
import logging

class XmlTree():

    def __init__(self):
        self.hdlr = logging.FileHandler('xml-comparison.log')
        self.formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')

    @staticmethod
    def convert_string_to_tree( xmlString):

        return ET.fromstring(xmlString)

    def xml_compare(self, x1, x2, excludes=[]):
        """
        Compares two xml etrees
        :param x1: the first tree
        :param x2: the second tree
        :param excludes: list of string of attributes to exclude from comparison
        :return:
            True if both files match
        """

        if x1.tag != x2.tag:
            self.logger.debug('Tags do not match: %s and %s' % (x1.tag, x2.tag))
            return False
        for name, value in x1.attrib.items():
            if not name in excludes:
                if x2.attrib.get(name) != value:
                    self.logger.debug('Attributes do not match: %s=%r, %s=%r'
                                 % (name, value, name, x2.attrib.get(name)))
                    return False
        for name in x2.attrib.keys():
            if not name in excludes:
                if name not in x1.attrib:
                    self.logger.debug('x2 has an attribute x1 is missing: %s'
                                 % name)
                    return False
        if not self.text_compare(x1.text, x2.text):
            self.logger.debug('text: %r != %r' % (x1.text, x2.text))
            return False
        if not self.text_compare(x1.tail, x2.tail):
            self.logger.debug('tail: %r != %r' % (x1.tail, x2.tail))
            return False
        cl1 = x1.getchildren()
        cl2 = x2.getchildren()
        if len(cl1) != len(cl2):
            self.logger.debug('children length differs, %i != %i'
                         % (len(cl1), len(cl2)))
            return False
        i = 0
        for c1, c2 in zip(cl1, cl2):
            i += 1
            if not c1.tag in excludes:
                if not self.xml_compare(c1, c2, excludes):
                    self.logger.debug('children %i do not match: %s'
                                 % (i, c1.tag))
                    return False
        return True

    def text_compare(self, t1, t2):
        """
        Compare two text strings
        :param t1: text one
        :param t2: text two
        :return:
            True if a match
        """
        if not t1 and not t2:
            return True
        if t1 == '*' or t2 == '*':
            return True
        return (t1 or '').strip() == (t2 or '').strip()

第二段代码包含一些XML示例及其比较：

xml1 = "<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

xml2 = "<note><to>Tove</to><from>Daniel</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

tree1 = XmlTree.convert_string_to_tree(xml1)
tree2 = XmlTree.convert_string_to_tree(xml2)

comparator = XmlTree()

if comparator.xml_compare(tree1, tree2, ["from"]):
    print "XMLs match"
else:
    print "XMLs don't match"

大部分的代码功劳必须归功于syawar

- danimirror

这段代码需要进行修改才能在Python3.5上运行，修改如下：

def __init__(self):
    self.logger = logging.getLogger('xml_compare')
    self.logger.setLevel(logging.DEBUG)
    self.hdlr = logging.FileHandler('xml-comparison.log', encoding='utf-8') # 修改1：加入文件编码
    self.formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s') # 修改2：格式化字符串中间的空格
    self.hdlr.setLevel(logging.DEBUG)
    self.hdlr.setFormatter(self.formatter)
    self.logger.addHandler(self.hdlr)

- Roy learns to code

1

或许我错过了什么，但是这个方法一旦找到了差异就停止检查XML，这是一个很好的开始方式，但它并没有按预期工作... - ivoruJavaBoy

5

如果您的目的是比较XML内容和属性，而不仅仅是逐字比较文件，那么这个问题存在一些细微差别，因此没有适用于所有情况的解决方案。

您需要了解XML文件中哪些内容很重要。

通常情况下，元素标签中列出的属性顺序并不重要。也就是说，两个XML文件仅在元素属性顺序上有所不同，通常应该被视为相同。

但这只是一个通用的部分。

棘手的部分取决于应用程序。例如，文件中某些元素的空格格式可能并不重要，并且可能添加到XML中以使其易于阅读等等。

最近版本的ElementTree模块具有一个canonicalize()函数，可以处理简单的情况，通过将XML字符串置于规范格式中。

我在最近的项目的单元测试中使用了这个函数，将已知的XML输出与有时更改属性顺序的软件包的输出进行比较。在这种情况下，文本元素中的空格不重要，但有时用于格式化。

import xml.etree.ElementTree as ET
def _canonicalize_XML( xml_str ):
    """ Canonicalizes XML strings, so they are safe to 
        compare directly. 
        Strips white space from text content."""

    if not hasattr( ET, "canonicalize" ):
        raise Exception( "ElementTree missing canonicalize()" )

    root = ET.fromstring( xml_str )
    rootstr = ET.tostring( root )
    return ET.canonicalize( rootstr, strip_text=True )

使用时，类似如下：

file1 = ET.parse('file1.xml')
file2 = ET.parse('file2.xml')

canon1 = _canonicalize_XML( ET.tostring( file1.getroot() ) )
canon2 = _canonicalize_XML( ET.tostring( file2.getroot() ) )

print( canon1 == canon2 )

在我的发行版中，Python 2 没有 canonicalize() 函数，但是 Python 3 有该函数。

- Steve White

这是一个很棒的答案，我认为它应该在2023年被接受。 - djanowski

1

另一个使用xml.etree的脚本。虽然很糟糕，但它可以工作 :)

#!/usr/bin/env python

import sys
import xml.etree.ElementTree as ET

from termcolor import colored

tree1 = ET.parse(sys.argv[1])
root1 = tree1.getroot()

tree2 = ET.parse(sys.argv[2])
root2 = tree2.getroot()

class Element:
    def __init__(self,e):
        self.name = e.tag
        self.subs = {}
        self.atts = {}
        for child in e:
            self.subs[child.tag] = Element(child)

        for att in e.attrib.keys():
            self.atts[att] = e.attrib[att]

        print "name: %s, len(subs) = %d, len(atts) = %d" % ( self.name, len(self.subs), len(self.atts) )

    def compare(self,el):
        if self.name!=el.name:
            raise RuntimeError("Two names are not the same")
        print "----------------------------------------------------------------"
        print self.name
        print "----------------------------------------------------------------"
        for att in self.atts.keys():
            v1 = self.atts[att]
            if att not in el.atts.keys():
                v2 = '[NA]'
                color = 'yellow'
            else:
                v2 = el.atts[att]
                if v2==v1:
                    color = 'green'
                else:
                    color = 'red'
            print colored("first:\t%s = %s" % ( att, v1 ), color)
            print colored("second:\t%s = %s" % ( att, v2 ), color)

        for subName in self.subs.keys():
            if subName not in el.subs.keys():
                print colored("first:\thas got %s" % ( subName), 'purple')
                print colored("second:\thasn't got %s" % ( subName), 'purple')
            else:
                self.subs[subName].compare( el.subs[subName] )



e1 = Element(root1)
e2 = Element(root2)

e1.compare(e2)

- psilouette

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nick Bastin · Accepted Answer

12

这实际上是一个相当具有挑战性的问题（由于“差异”意味着通常在观察者的眼中，因为会有语义上“等效”的信息，您可能不希望将其标记为差异）。

您可以尝试使用xmldiff，它基于论文Change Detection in Hierarchically Structured Information中的工作。

- Nick Bastin

1

xmldiff是GPL许可的。这是否意味着如果我使用它，我必须开源我的源代码？ - guettli

2

参考用的死灵响应：GPL 意味着如果用户要求，您必须向其提供源代码。这并不意味着您必须向所有人（也不是免费）公开它，并且您始终可以通过附加合同对用户施加额外限制。 - Giacomo Lacava

@guettli 另外，需要注意的是，由于它是LGPL（严格的GPL），您将使用xmldiff开源的所有代码也必须遵循LGPL。如果您不打算开源您的项目，请直接使用它。 - Dmytro Chasovskyi

1

另外值得一提的是，从xmldiff 2.0版本开始，该项目已转移到MIT许可证下，这应该能够避免许可证方面的麻烦。 - Dmytro Chasovskyi