使用列表和字典在Python中优化具有数字比较的循环效率

Question

使用列表和字典在Python中优化具有数字比较的循环效率

3

我有一个包含整数的列表：candidates = [1, 2, 3, 4, 5, 16, 20]。这个列表可能包含超过100万个项。

我有一个字典number_ranges，它的键是一个整数，值是一个包含最小值和最大值范围对象的列表。当前该字典包含约500k个键。

{
    {5: [{"start": 0, "end": 9}]},
    {16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
}

我现在正在遍历这个列表：

for candidate in candidates:
    number = search_in_range(candidate, number_ranges)

我检查number_ranges中的范围是否有与candidates相匹配的数字，如果有，就返回将在后续使用的键。

def search_in_range(candidate, number_ranges):
    for number_range_key in number_ranges:
        for number in number_ranges[number_range_key]:
            if int(number['start']) <= candidate <= int(number['end']):
                return {"key": number_range_key, "candidate": candidate}

当我运行这个程序时，我发现处理1000个数字需要大约40秒的时间。这意味着如果我有100万个数字，我需要超过11个小时才能完成处理。

('2018-12-19 16:22:47', 'Read', 1000)
('2018-12-19 16:23:30', 'Read', 2000)
('2018-12-19 16:24:10', 'Read', 3000)
('2018-12-19 16:24:46', 'Read', 4000)
('2018-12-19 16:25:26', 'Read', 5000)
('2018-12-19 16:25:59', 'Read', 6000)
('2018-12-19 16:26:39', 'Read', 7000)
('2018-12-19 16:27:28', 'Read', 8000)
('2018-12-19 16:28:15', 'Read', 9000)
('2018-12-19 16:28:57', 'Read', 10000)

期望的输出是从 number_ranges 中返回匹配范围内和用于查找该键的candidate 数字的键，即在函数 search_in_range 中返回 {"key": number_range_key, "candidate": candidate}。

在Python中，有哪些推荐的方法来优化此算法？

- drjz

3

请问您能否编辑这个问题，展示您期望的输出结果（针对给定的数据），并解释您是如何得出这个结果的？ - cs95

也许可以预处理字典，仅保留不重叠的情况。就像你的例子中 key:16 只需要检查它是否在 15 到 20 之间一次即可。此外，如果你有一个键包含了 15 到 18 和 18 到 25 的情况，那么将它们合并成 15 到 25。 - CodeCollector

1

你的格式似乎不匹配。你确定number_ranges是一个字典吗？你把它写成了一个字典列表，并在代码中将其作为列表处理（"for number_range_key in number_ranges"而不是"for number_range_key in number_ranges.keys()"）。 - Stuart Buckingham

1

@StuartBuckingham 直接迭代 dict 对象会直接迭代键。实际上，在 Python 2 中，for key in my_dict.keys() 是一种反模式，而在 Python 3 中我仍然认为它是多余的。 - juanpa.arrivillaga

@juanpa.arrivillaga 你说得对。我认为 for number_range_key, number_range_value in number_ranges.items(): 更好，因为它可以避免另一个 O(1) 的查找。 - Stuart Buckingham

3个回答

2

稍加重组后，您的代码就成为了一个经典的区间树问题。

看一下这个包https://pypi.org/project/intervaltree/ 与普通的区间树唯一的不同之处在于您有一些覆盖多个间隔的项，但是将它们分成单个间隔会非常容易，例如{16.1：{"start"：15，"end"：20}，16.2：{"start"：16，"end"：18}}

通过使用intervaltree包，创建了一个平衡的二叉搜索树，比使用嵌套for循环更有效率。此解决方案对于搜索每个候选者的时间复杂度为O（logn），而for循环的时间复杂度为O（n）。如果有100万以上的候选者，则intervaltree包将比接受的嵌套for循环答案快得多。

- Stuart Buckingham

0

尽管这个问题已经有了一个被接受的答案，但我想为其他人补充一点，这种情况确实值得创建反向查找。这是一次性的头疼，随着候选列表的增长，将节省大量实际时间。字典查找是O(1)的，如果你需要执行多个查找，你应该考虑创建一个反向映射。

number_ranges = [
    {5: [{"start": 0, "end": 9}]},
    {16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]}
]

from collections import defaultdict

reversed_number_ranges = defaultdict(set) #returns an empty set, avoiding key errors.


for number in number_ranges:
    for k,v in number.items(): 
        ranges = set() #create a set of values which fall within range
        for range_dict in v:
            ranges.update(range(range_dict["start"], range_dict["end"] + 1)) #assuming "end" is included. remove the +1 for right exclusive.
        for i in ranges:
            reversed_number_ranges[i].add(k) #add the key for each location in a range.


candidates = [1, 2 ,3, 4 , 5, 16, 20]

for candidate in candidates:
    print(candidate, reversed_number_ranges[candidate])

输出：

1 {5}
2 {5}
3 {5}
4 {5}
5 {5}
16 {16}
20 {16}

- Paritosh Singh

1

很好，但是如果起点和终点之间的差距很大，会有很大的问题。最好与二分法结合使用。 - tobias_k

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tobias_k · Accepted Answer

你的候选人列表已经排序，所以相反地：循环遍历number_ranges中的字典，并使用bisect来二分搜索匹配的候选人。这将把复杂度从O(n*m)降低到O(n*logm*k)，其中n表示字典数量，m表示候选人数量，k表示平均匹配的候选人数量。

（注意：我改变了你的number_ranges的格式，从一个只有单个元素的set和dict组成的集合，改为更有意义的dict）

candidates = [1, 2, 3, 4, 5, 16, 20]
number_ranges = {
    5: [{"start": 0, "end": 9}],
    16: [{"start": 15, "end": 20}, {"start": 16, "end": 18}]
}

import bisect

for key, values in number_ranges.items():
    for value in values:
        start, end = value["start"], value["end"]
        lower = bisect.bisect_left(candidates, start)
        upper = bisect.bisect_right(candidates, end)
        for cand in range(lower, upper):
            res = {"key": key, "candidate": candidates[cand]}
            print(value, res)

输出：

{'start': 0, 'end': 9} {'key': 5, 'candidate': 1}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 2}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 3}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 4}
{'start': 0, 'end': 9} {'key': 5, 'candidate': 5}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 16}
{'start': 15, 'end': 20} {'key': 16, 'candidate': 20}
{'start': 16, 'end': 18} {'key': 16, 'candidate': 16}

如果实际上未对candidates进行排序，或者您希望结果按照候选人而不是按字典顺序排序，则可以将其作为预处理或后处理步骤进行排序。