Python：快速大整数键字典

Question

Python：快速大整数键字典

pythonoptimizationdictionaryintbiginteger

3

我有一个包含>10,000个整数项的列表。这些项的值可能非常高，高达10^27。现在我想创建所有项的配对并计算它们的和。然后我想寻找具有相同和的不同配对。

例如：

l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...

pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...

pairs[7]的内容是我要找的，它给出了两个相同值和的数对。

我已经按照以下方式实现了它，但我想知道是否可以更快地完成。目前，在一台快速计算机上，对于10,000个项目，需要超过6小时的时间。（正如我所说，l的值以及pairs的键是int类型，最大可达到10^27）

l = [4,3,6,1]
pairs = {}
for i in range( len( l  )  ):
    for j in range(i+1, len( l ) ):
        s = l[i] + l[j]
        if not s in pairs:
            pairs[s] = []
        pairs[s].append((i,j))

# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}

编辑：我想添加一些背景信息，如Simon Stelling所要求的。

目标是寻找形式类比，例如

lays : laid :: says : said

在像单词列表这样的列表中

[ lays, lay, laid, says, said, foo, bar ... ]

我已经有一个函数analogy(a,b,c,d)，如果a:b::c:d则返回True。然而，我需要检查从列表中创建的所有可能的四元组，这将是大约O((n^4)/2)的复杂度。

作为预处理器，我想使用字符计数属性。它表示每个字符在(a,d)和(b,c)中具有相同的计数。例如，在“layssaid”中我们有2个a，而在“laidsays”中也是如此。

因此，目前的想法是：

为每个单词创建一个“字符计数向量”，并将其表示为整数（列表中的项l）
创建pairs中的所有配对，并查看是否存在“配对簇”，即特定字符计数向量总和的多个配对。

它可以工作，但速度较慢。复杂度降至O((n^2)/2)，但仍然很高，尤其是字典查找和插入操作太频繁了。

- Georg Jähnig

3

问题不在于整数的大小，而是你有一个双重嵌套循环涵盖了所有项目，这是O(n^2)的。 - blubb

你真的等了6个小时直到它完成了吗？不过我建议看一下PEP8 :) - Ant

1

同时，range返回一个包含所有展开项的完整列表，请使用xrange代替。 - Fredrik Pihl

是的，又一个来自P？NP类的任务。 - Zaur Nasibov

4个回答

1

只是一个提示。请查看itertools.combinations。

这不完全符合您的要求（因为它存储的是值对，而不是索引对），但它可以作为起始代码。

from itertools import combinations
for (a, b) in combinations(l, 2):
    pairs.setdefault(a + b, []).append((a, b))

- Don

0

最终，我想出了自己的解决方案，平均只需要计算时间的一半。

基本思路：不是每次都读写增长的字典n^2次，而是首先将所有的总和收集到一个列表中。然后对列表进行排序，在排序后的列表中查找相邻的相同项。

这是代码：

from operator import itemgetter

def getPairClusters( l ):

    # first, we just store all possible pairs sequentially
    # clustering will happen later
    pairs = []

    for i in xrange( len( l)  ):
        for j in xrange(i+1, len( l ) ):
            pair = l[i] + l[j]
            pairs.append( ( pair, i, j ) )
    pairs.sort(key=itemgetter(0))

    # pairs = [ (4, 1, 3), (5, 0, 3), (7, 0, 1), (7, 2, 3), (9, 1, 2), (10, 0, 2)]
    # a list item of pairs now contains a tuple (like (4, 1, 3)) with
    # * the sum of two l items: 4
    # * the index of the two l items: 1, 3

    # now clustering starts
    # we want to find neighbouring items as
    # (7, 0, 1), (7, 2, 3)
    # (since 7=7)

    pairClusters = []

    # flag if we are within a cluster
    # while iterating over pairs list
    withinCluster = False

            # iterate over pair list
    for i in xrange(len(pairs)-1):
        if not withinCluster:
            if pairs[i][0] == pairs[i+1][0]:
                # if not within a cluster
                # and found 2 neighbouring same numbers:
                # init new cluster
                pairCluster = [ ( pairs[i][1], pairs[i][2] ) ]
                withinCluster = True
        else:
            # if still within cluster
            if pairs[i][0] == pairs[i+1][0]:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
            # else cluster has ended
            # (next neighbouring item has different number)
            else:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
                pairClusters.append(pairCluster)
                withinCluster = False

    return pairClusters

l = [4,3,6,1]

print getPairClusters(l)

- Georg Jähnig

0

上面SimonStelling的评论是正确的；生成所有可能的对是基本上很慢的，除了改变你的算法之外，你无法做任何事情。从itertools中使用的正确函数是product；你可以通过不创建额外的变量或执行不必要的列表索引来获得一些小的改进，但在底层这些仍然都是O(n^2)。以下是我会怎么做的：

from itertools import product
l = [4,3,6,1]
pairs = {}
for (m,n) in product(l,repeat=2):
    pairs.setdefault(m+n, []).append((m,n))

- glyphobet

正确的itertools函数是combinations()，正如Don所建议的那样。product()会给出太多结果，例如对于一对(4,1)和(1,4)。 - Georg Jähnig

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- blubb · Accepted Answer

有些琐碎的优化技巧，例如将常量值缓存到本地变量中并使用xrange而不是range：

pairs = {}
len_l = len(l)
for i in xrange(len_l):
    for j in xrange(i+1, len_l):
        s = l[i] + l[j]
        res = pairs.setdefault(s, [])
        res.append((i,j))

然而，更明智的做法可能是不要预先计算列表，而是在概念层面上优化方法。您想要实现的内在目标是什么？您真的只想计算您的操作吗？还是您将使用该结果进行其他操作？那个其他操作是什么？