两个字典的递归差异(键和值)?

49

我有一个Python字典,称为d1,以及稍后时间的该字典版本,称为d2。我想找到d1d2之间的所有更改,也就是添加、删除或更改的所有内容。棘手的部分是,值可以是整数、字符串、列表或字典,因此需要进行递归处理。以下是目前我的代码:

def dd(d1, d2, ctx=""):
    print "Changes in " + ctx
    for k in d1:
        if k not in d2:
            print k + " removed from d2"
    for k in d2:
        if k not in d1:
            print k + " added in d2"
            continue
        if d2[k] != d1[k]:
            if type(d2[k]) not in (dict, list):
                print k + " changed in d2 to " + str(d2[k])
            else:
                if type(d1[k]) != type(d2[k]):
                    print k + " changed to " + str(d2[k])
                    continue
                else:
                    if type(d2[k]) == dict:
                        dd(d1[k], d2[k], k)
                        continue
    print "Done with changes in " + ctx
    return

除非值为列表,否则它工作得很好。 我无法想出一个优雅的处理列表的方法,而不必在if(type(d2) == list)之后重复这个函数的大型、稍微更改版本。

有什么想法吗?

编辑:这与此帖子不同,因为键可以更改。


如果它们在两个不同的字典中具有相同的键,我认为:1个被移除;8个添加(在相同的键下)。如果它们在不同的键下,则它们是不同的元素。 - Alex
这可能很快变得棘手。顺序重要吗?如果将 8 移到开头:[8, 1, 2, 3, 4, 5, 6, 7],顺序是否重要,或者只考虑存在/不存在(集合)?列表中是否可以包含嵌套字典,该字典又包含一个列表等等? - samplebias
你能给一个输出失败的例子吗? - Nick ODell
@samplebias:是的。列表可以包含字典,而字典可以包含……这就是无穷无尽的循环。我现在并不真正需要元组,但这并没有什么帮助。 - Alex
@lafrasu:我正在将这些字典存储在mongodb中,应用程序需要能够将嵌套对象推入对象中。 - Alex
显示剩余3条评论
10个回答

77

如果您想要递归查找差异,我已经为Python编写了一个包:

https://github.com/seperman/deepdiff

安装

从PyPi安装:

pip install deepdiff

使用示例

导入

>>> from deepdiff import DeepDiff
>>> from pprint import pprint
>>> from __future__ import print_function # In case running on Python 2

同一对象返回空

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = t1
>>> print(DeepDiff(t1, t2))
{}

物品类型已更改

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:"2", 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{ 'type_changes': { 'root[2]': { 'newtype': <class 'str'>,
                                 'newvalue': '2',
                                 'oldtype': <class 'int'>,
                                 'oldvalue': 2}}}

物品的价值已经发生改变

>>> t1 = {1:1, 2:2, 3:3}
>>> t2 = {1:1, 2:4, 3:3}
>>> pprint(DeepDiff(t1, t2), indent=2)
{'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}

项目已添加和/或删除

>>> t1 = {1:1, 2:2, 3:3, 4:4}
>>> t2 = {1:1, 2:4, 3:3, 5:5, 6:6}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff)
{'dic_item_added': ['root[5]', 'root[6]'],
 'dic_item_removed': ['root[4]'],
 'values_changed': {'root[2]': {'newvalue': 4, 'oldvalue': 2}}}

字符串差异

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world"}}
>>> t2 = {1:1, 2:4, 3:3, 4:{"a":"hello", "b":"world!"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { 'root[2]': {'newvalue': 4, 'oldvalue': 2},
                      "root[4]['b']": { 'newvalue': 'world!',
                                        'oldvalue': 'world'}}}

字符串差异 2

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world!\nGoodbye!\n1\n2\nEnd"}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n1\n2\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'values_changed': { "root[4]['b']": { 'diff': '--- \n'
                                                '+++ \n'
                                                '@@ -1,5 +1,4 @@\n'
                                                '-world!\n'
                                                '-Goodbye!\n'
                                                '+world\n'
                                                ' 1\n'
                                                ' 2\n'
                                                ' End',
                                        'newvalue': 'world\n1\n2\nEnd',
                                        'oldvalue': 'world!\n'
                                                    'Goodbye!\n'
                                                    '1\n'
                                                    '2\n'
                                                    'End'}}}

>>> 
>>> print (ddiff['values_changed']["root[4]['b']"]["diff"])
--- 
+++ 
@@ -1,5 +1,4 @@
-world!
-Goodbye!
+world
 1
 2
 End

类型更改

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":"world\n\n\nEnd"}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'type_changes': { "root[4]['b']": { 'newtype': <class 'str'>,
                                      'newvalue': 'world\n\n\nEnd',
                                      'oldtype': <class 'list'>,
                                      'oldvalue': [1, 2, 3]}}}

列表差异

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3, 4]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{'iterable_item_removed': {"root[4]['b'][2]": 3, "root[4]['b'][3]": 4}}

列表差异2:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'iterable_item_added': {"root[4]['b'][3]": 3},
  'values_changed': { "root[4]['b'][1]": {'newvalue': 3, 'oldvalue': 2},
                      "root[4]['b'][2]": {'newvalue': 2, 'oldvalue': 3}}}

忽略顺序或重复项的列表差异:(与上面相同的字典)

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, 3]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 3, 2, 3]}}
>>> ddiff = DeepDiff(t1, t2, ignore_order=True)
>>> print (ddiff)
{}

包含字典的列表:

>>> t1 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:1, 2:2}]}}
>>> t2 = {1:1, 2:2, 3:3, 4:{"a":"hello", "b":[1, 2, {1:3}]}}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (ddiff, indent = 2)
{ 'dic_item_removed': ["root[4]['b'][2][2]"],
  'values_changed': {"root[4]['b'][2][1]": {'newvalue': 3, 'oldvalue': 1}}}

集合:

>>> t1 = {1, 2, 8}
>>> t2 = {1, 2, 3, 5}
>>> ddiff = DeepDiff(t1, t2)
>>> pprint (DeepDiff(t1, t2))
{'set_item_added': ['root[3]', 'root[5]'], 'set_item_removed': ['root[8]']}
< p>命名元组:

>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> t1 = Point(x=11, y=22)
>>> t2 = Point(x=11, y=23)
>>> pprint (DeepDiff(t1, t2))
{'values_changed': {'root.y': {'newvalue': 23, 'oldvalue': 22}}}

自定义对象:

>>> class ClassA(object):
...     a = 1
...     def __init__(self, b):
...         self.b = b
... 
>>> t1 = ClassA(1)
>>> t2 = ClassA(2)
>>> 
>>> pprint(DeepDiff(t1, t2))
{'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}

新增对象属性:

>>> t2.c = "new attribute"
>>> pprint(DeepDiff(t1, t2))
{'attribute_added': ['root.c'],
 'values_changed': {'root.b': {'newvalue': 2, 'oldvalue': 1}}}

感谢 @LukasN.P.Egger。 - Seperman
@MohitC 它不兼容 Python 2.6。(在 Github 存储库顶部,它说明了与哪些版本兼容)。你为什么要使用 Python 2.6? - Seperman
@Seperman,有没有DeepDiff(t1,t2).is_equal方法,还是我需要执行str(DeepDiff(t1,t2))==“ {}”?我只需要知道它们是否相等... - Shoham
1
绝对是救命稻草,谢谢! - undefined
1
非常欢迎你,@Nahko! - undefined
显示剩余8条评论

22

这是一个受到温斯顿·尤尔特启发的实现。

def recursive_compare(d1, d2, level='root'):
    if isinstance(d1, dict) and isinstance(d2, dict):
        if d1.keys() != d2.keys():
            s1 = set(d1.keys())
            s2 = set(d2.keys())
            print('{:<20} + {} - {}'.format(level, s1-s2, s2-s1))
            common_keys = s1 & s2
        else:
            common_keys = set(d1.keys())

        for k in common_keys:
            recursive_compare(d1[k], d2[k], level='{}.{}'.format(level, k))

    elif isinstance(d1, list) and isinstance(d2, list):
        if len(d1) != len(d2):
            print('{:<20} len1={}; len2={}'.format(level, len(d1), len(d2)))
        common_len = min(len(d1), len(d2))

        for i in range(common_len):
            recursive_compare(d1[i], d2[i], level='{}[{}]'.format(level, i))

    else:
        if d1 != d2:
            print('{:<20} {} != {}'.format(level, d1, d2))

if __name__ == '__main__':
    d1={'a':[0,2,3,8], 'b':0, 'd':{'da':7, 'db':[99,88]}}
    d2={'a':[0,2,4], 'c':0, 'd':{'da':3, 'db':7}}

    recursive_compare(d1, d2)

将返回:

root                 + {'b'} - {'c'}
root.a               len1=4; len2=3
root.a[2]            3 != 4
root.d.db            [99, 88] != 7
root.d.da            7 != 3

如果值是包含不同元素顺序的字典列表,那么这将失败,对吗? - Konstantin

10

一种选择是将遇到的任何列表转换为带有索引作为键的字典。例如:

# add this function to the same module
def list_to_dict(l):
    return dict(zip(map(str, range(len(l))), l))

# add this code under the 'if type(d2[k]) == dict' block
                    elif type(d2[k]) == list:
                        dd(list_to_dict(d1[k]), list_to_dict(d2[k]), k)

以下是您在评论中提供的示例字典的输出:

>>> d1 = {"name":"Joe", "Pets":[{"name":"spot", "species":"dog"}]}
>>> d2 = {"name":"Joe", "Pets":[{"name":"spot", "species":"cat"}]}
>>> dd(d1, d2, "base")
Changes in base
Changes in Pets
Changes in 0
species changed in d2 to cat
Done with changes in 0
Done with changes in Pets
Done with changes in base

注意,这将按索引逐个比较,因此需要进行一些修改才能使其对于添加或删除的列表项正常工作。


8

一个想法:您可以尝试使用面向对象的方法,派生自己的字典类来跟踪任何对其进行的更改(并报告它们)。这似乎比尝试比较两个字典具有许多优点...其中一个在末尾提到。

为了展示如何完成这项工作,这里是一个相当完整且经过最小测试的示例实现,应该可以在Python 2和3中使用:

import sys

_NUL = object()  # unique object

if sys.version_info[0] > 2:
    def iterkeys(d, **kw):
        return iter(d.keys(**kw))
else:
    def iterkeys(d, **kw):
        return d.iterkeys(**kw)


class TrackingDict(dict):
    """ Dict subclass which tracks all changes in a _changelist attribute. """
    def __init__(self, *args, **kwargs):
        super(TrackingDict, self).__init__(*args, **kwargs)
        self.clear_changelist()
        for key in sorted(iterkeys(self)):
            self._changelist.append(AddKey(key, self[key]))

    def clear_changelist(self):  # additional public method
        self._changelist = []

    def __setitem__(self, key, value):
        modtype = ChangeKey if key in self else AddKey
        super(TrackingDict, self).__setitem__(key, value)
        self._changelist.append(modtype(key, self[key]))

    def __delitem__(self, key):
        super(TrackingDict, self).__delitem__(key)
        self._changelist.append(RemoveKey(key))

    def clear(self):
        deletedkeys = self.keys()
        super(TrackingDict, self).clear()
        for key in sorted(deletedkeys):
            self._changelist.append(RemoveKey(key))

    def update(self, other=_NUL):
        if other is not _NUL:
            otherdict = dict(other)  # convert to dict if necessary
            changedkeys = set(k for k in otherdict if k in self)
            super(TrackingDict, self).update(other)
            for key in sorted(iterkeys(otherdict)):
                if key in changedkeys:
                    self._changelist.append(ChangeKey(key, otherdict[key]))
                else:
                    self._changelist.append(AddKey(key, otherdict[key]))

    def setdefault(self, key, default=None):
        if key not in self:
            self[key] = default  # will append an AddKey to _changelist
        return self[key]

    def pop(self, key, default=_NUL):
        if key in self:
            ret = self[key]  # save value
            self.__delitem__(key)
            return ret
        elif default is not _NUL:  # default specified
            return default
        else:  # not there & no default
            self[key]  # allow KeyError to be raised

    def popitem(self):
        key, value = super(TrackingDict, self).popitem()
        self._changelist.append(RemoveKey(key))
        return key, value

# change-tracking record classes

class DictMutator(object):
    def __init__(self, key, value=_NUL):
        self.key = key
        self.value = value
    def __repr__(self):
        return '%s(%r%s)' % (self.__class__.__name__, self.key,
                             '' if self.value is _NUL else ': '+repr(self.value))

class AddKey(DictMutator): pass
class ChangeKey(DictMutator): pass
class RemoveKey(DictMutator): pass

if __name__ == '__main__':
    import traceback
    import sys

    td = TrackingDict({'one': 1, 'two': 2})
    print('changelist: {}'.format(td._changelist))

    td['three'] = 3
    print('changelist: {}'.format(td._changelist))

    td['two'] = -2
    print('changelist: {}'.format(td._changelist))

    td.clear()
    print('changelist: {}'.format(td._changelist))

    td.clear_changelist()

    td['newkey'] = 42
    print('changelist: {}'.format(td._changelist))

    td.setdefault('another') # default None value
    print('changelist: {}'.format(td._changelist))

    td.setdefault('one more', 43)
    print('changelist: {}'.format(td._changelist))

    td.update(zip(('another', 'one', 'two'), (17, 1, 2)))
    print('changelist: {}'.format(td._changelist))

    td.pop('newkey')
    print('changelist: {}'.format(td._changelist))

    try:
        td.pop("won't find")
    except KeyError:
        print("KeyError as expected:")
        traceback.print_exc(file=sys.stdout)
    print('...and no change to _changelist:')
    print('changelist: {}'.format(td._changelist))

    td.clear_changelist()
    while td:
        td.popitem()
    print('changelist: {}'.format(td._changelist))

注意,与字典的简单比较不同,这个类会告诉你哪些键被添加了然后又被删除了——换句话说,它保留完整的历史记录,直到其_changelist被清除。

输出:

changelist: [AddKey('one': 1), AddKey('two': 2)]
changelist: [AddKey('one': 1), AddKey('two': 2), AddKey('three': 3)]
changelist: [AddKey('one': 1), AddKey('two': 2), AddKey('three': 3), ChangeKey('two': -2)]
changelist: [AddKey('one': 1), AddKey('two': 2), AddKey('three': 3), ChangeKey('two': -2), RemoveKey('one'), RemoveKey('three'), RemoveKey('two')]
changelist: [AddKey('newkey': 42)]
changelist: [AddKey('newkey': 42), AddKey('another': None)]
changelist: [AddKey('newkey': 42), AddKey('another': None), AddKey('one more': 43)]
changelist: [AddKey('newkey': 42), AddKey('another': None), AddKey('one more': 43), ChangeKey('another': 17), AddKey('one': 1), AddKey('two': 2)]
changelist: [AddKey('newkey': 42), AddKey('another': None), AddKey('one more': 43), ChangeKey('another': 17), AddKey('one': 1), AddKey('two': 2), RemoveKey('newkey')]
KeyError as expected:
Traceback (most recent call last):
  File "trackingdict.py", line 122, in <module>
    td.pop("won't find")
  File "trackingdict.py", line 67, in pop
    self[key]  # allow KeyError to be raised
KeyError: "won't find"
...and no change to _changelist:
changelist: [AddKey('newkey': 42), AddKey('another': None), AddKey('one more': 43), ChangeKey('another': 17), AddKey('one': 1), AddKey('two': 2), RemoveKey('newkey')]
changelist: [RemoveKey('one'), RemoveKey('two'), RemoveKey('another'), RemoveKey('one more')]

7

你的函数应该先检查其参数的类型,编写函数以便可以处理列表、字典、整数和字符串。这样你就不必重复任何内容,只需递归调用即可。

伪代码:

def compare(d1, d2):
     if d1 and d2 are dicts
            compare the keys, pass values to compare
     if d1 and d2 are lists
            compare the lists, pass values to compare
     if d1 and d2 are strings/ints
            compare them

7

根据Serge的建议,我发现这个解决方案有助于快速地获取两个字典是否“完全匹配”的布尔返回值:

import json

def match(d1, d2):
    return json.dumps(d1, sort_keys=True) == json.dumps(d2, sort_keys=True)

1

在递归访问对象时,考虑使用hasattr(obj, '__iter__')。如果一个对象实现了__iter__方法,则表示该对象可以进行迭代。


0

0

这是一个示例,可以轻松扩展以处理其他Python数据类型:

def deep_compare(a, b) -> bool:
    if type(a) is not type(b): return False

    if type(a) is dict:
        if not deep_compare(list(a.keys()), list(b.keys())): return False
        if not deep_compare(list(a.values()), list(b.values())): return False
    elif isinstance(a, (list, tuple, set)):
        for a_i, b_i in zip(a, b):
            if not deep_compare(a_i, b_i): return False
    else:  # scalar, bool, str
        if a != b: return False

    return True

0
您可以尝试以下简单实现。
def recursive_compare(obj1, obj2):
""" Compare python objects recursively, support type:
"int, float, long, basestring, set, datetime, date, dict, Sequence"

Example:
>>> recursive_compare([1, 2, 3], [1, 2, 3])
>>> True
>>> recursive_compare([1, 2, 3], [1, 2, 4])
>>> False
>>> recursive_compare({'a': 1}, {'a': 2})
>>> False
"""

def _diff(obj1, obj2):
    # exclude type basestring for backward-compatible python2:
    # <str, unicode>
    if type(obj1) != type(obj2) and not isinstance(obj1, basestring):
        return False

    elif isinstance(obj1,
                    (int, float, long, basestring, set, datetime, date)):
        if obj1 != obj2:
            return False

    elif isinstance(obj1, dict):
        keys = obj1.viewkeys() & obj2.viewkeys()
        if obj1 and len(keys) == 0 \
            or keys.difference(set(obj1.keys())) \
                or keys.difference(set(obj2.keys())):
            return False

        for k in keys:
            if _diff(obj1[k], obj2[k]) is False:
                return False

    elif isinstance(obj1, collections.Sequence):
        # require sorted sequence object
        if len(obj1) != len(obj2):
            return False

        for i in range(len(obj1)):
            if _diff(obj1[i], obj2[i]) is False:
                return False

    else:
        raise TypeError('do not support type {} to compare'.format(
            type(obj1)))

return False if _diff(obj1, obj2) is False else True

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接