Python中快速数据比较

Question

Python中快速数据比较

3

我希望能够比对两个长度不同的字典形式的大量数据。

post = {0: [0.96180319786071777, 0.37529754638671875], 
        10: [0.20612385869026184, 0.17849941551685333],
        20: [0.20612400770187378, 0.17510984838008881],...}

pre = {0: [0.96180319786071777, 0.37529754638671875],
       1: [0.20612385869026184, 0.17849941551685333],
       2: [0.20612400770187378, 0.17510984838008881],
       5065: [0.80861318111419678, 0.76381617784500122],...}

我们需要获得的答案是5065：[0.80861318111419678，0.76381617784500122]。这是因为我们只比较值而不涉及索引。

我只是使用这个键-值对来记住数据的顺序。如果需要，数据类型可以替换为列表/集合。我需要找到那些不共同存在于字典中的元素的键:值（索引和值）对。

我正在使用非常简单的代码...

new = {}
found = []

for i in range(0, len(post)): 
    found= []
    for j in range(0, len(pre)): 
        if post[i] not in pre.values():
            if post[i] not in new:
                new[i] = post[i]
                found.append(j)             
                break
    if found:
        for f in found: pre.pop(f)

new {} 包含我需要的元素。

我面临的问题是这个过程太慢了。有时需要超过一个小时的时间来处理数据。有时数据可以更大。我需要它更快。

是否有一种有效的方法来实现我的目标？如果可能的话，我希望我们不依赖除了Python 2.5（64位）捆绑的包之外的其他外部包。

谢谢大家。

- sbetharia

2个回答

1

您的问题可能在于嵌套的for循环与使用range()，这会每次创建一个新列表，导致速度变慢。通过直接迭代pre和post并避免嵌套，您可能会获得一些自动加速。

post = {0: [0.96180319786071777, 0.37529754638671875],
       10: [0.20612385869026184, 0.17849941551685333],
       20: [0.20612400770187378, 0.17510984838008881]}

pre = {0: [0.96180319786071777, 0.37529754638671875],
       1: [0.20612385869026184, 0.17849941551685333],
       2: [0.20612400770187378, 0.17510984838008881],
    5065: [0.80861318111419678, 0.76381617784500122]}

'''Create sets of values, independent of dict key for O(1) lookup'''
post_set=set(map(tuple, post.values()))
pre_set=set(map(tuple, pre.values()))

'''Iterate through each structure only once, filtering items that are found in
   the sets we created earlier, updating new_diff'''
from itertools import ifilterfalse

new_diff=dict(ifilterfalse(lambda x: tuple(x[1]) in pre_set, post.items()))
new_diff.update(ifilterfalse(lambda x: tuple(x[1]) in post_set, pre.items()))

new_diff 现在是一个 dict，其中每个值都不在 post 和 pre 中找到，同时保留原始索引。

>>> print new_diff
{5065: [0.80861318111419678, 0.76381617784500122]}

- Austin Marshall

set(post.items())

错误：TypeError: 不可哈希类型：'list'

- sbetharia

在我的情况下，字典 a = dict(a=['A','B'], b=['C','D'])，我只需要比较这 2 个字典的值。你的示例代码会给出以下错误：set(post.items())

错误：TypeError: unhashable type: 'list'

- sbetharia

啊，我之前没注意到。我已经更新了我的答案来弥补。 - Austin Marshall

我不确定索引数是否也参与了比较。我们只需要比较值，而不是索引数。我会在问题中更新我的示例。 - sbetharia

当我们转换为（key，tuple（value））时，索引号也会参与比较，对吗？如果我错了，请纠正我。 - sbetharia

你的帖子特别说明了“我需要找出那些在字典中不共有的元素的键值（索引和值）对。” - Austin Marshall

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Amber · Accepted Answer

这基本上就是 set 的设计目的（计算项目集合之间的差异）。唯一需要注意的是，放入 set 中的元素需要是可哈希的，而 list 不是。然而，tuple 是可哈希的，因此如果你转换为 tuple，就可以将其放入 set 中：

post_set = set(tuple(x) for x in post.itervalues())
pre_set = set(tuple(x) for x in pre.itervalues())

items_in_only_one_set = post_set ^ pre_set

关于 set 的更多信息请参见：http://docs.python.org/library/stdtypes.html#set

在计算差异后获取原始索引，您可能需要生成反向查找表：

post_indices = dict((tuple(v),k) for k,v in post.iteritems())
pre_indices = dict((tuple(v),k) for k,v in pre.iteritems())

然后，您只需通过字典查找给定元组的索引：

index = post_indices.get(a_tuple, pre_indices.get(a_tuple))