在一个非常大的列表中查找元组

4

假设我有一个元组列表:

tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]

我想要做的是检查下列元组是否在元组库中存在。
search_list = [('a','a','1'), ('m', '1', 'l')]


def search_the_tupple(t_lib, s_list):
    for item in t_lib:
        if item in s_list:
           return(item)

print(search_the_tupple(tuple_library, search_list))

这段代码在元组库和搜索列表较小时可以正常工作,但随着这两个项目的增加,需要更长时间才能完成。
我们如何解决这个问题?

先排序再二分查找? - rdas
2
如果需要重复操作,请将其放入一个集合中。另外,修复您的代码错误 - 这段代码根本无法运行。 - Patrick Artner
3
如果您有一个由字符串或仅包含可哈希对象的元组构成的集合,则使用set是最佳选择。 - Reti43
1
@0 目前代码无法运行。将较大的列表放入集合中可以使查找时间恒定为O(1),从而减轻列表的O(N)。将两者都放入集合中会更快,通过集合交集操作可以得到所有交集,而不仅仅是上面修复后提供的第一个交集。 - Patrick Artner
2
集合交集是一个不错的选择,只要您不需要频繁索引并且不关心重复项。 - Niteya Shah
显示剩余6条评论
6个回答

11
  1. 使用 set()tuple_librarysearch_list 转换为 Python 集合
  2. 返回这两个集合的交集,也就是所有同时存在于 tuple_librarysearch_list 中的元素
tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]
search_list = [('a','a','1'), ('m', '1', 'l')]


def search_the_tupple(t_lib, s_list):
    return set(t_lib).intersection(set(s_list))


print(search_the_tupple(tuple_library, search_list))

假设你想保留元组的顺序(这样一个 ('m', '1', 'l') 将会出现,但是一个 ('m', 'l', '1') 将不会出现。

顺带一提:无论你使用 t_lib.intersection(s_list) 还是其它方式都无关紧要。


要实现 OP 中的结果,请使用 next((s for s in t_lib if s in the_intersection), None) - Aaron

1
您有以下方法来获得答案:
  1. 使用needle in haystack操作符进行普通的搜索。
  2. 将大型元组haystack转换为集合并查找。
  3. 将您的needle和haystack都转换为集合并返回交集。
正如您在最后看到的统计数据一样,随着您的haystack大小增加,方法1变得越来越慢。2和3都比1好,其中3表现最佳。
实际上,如果您只将您的haystack转换为集合一次,并在整个程序的持续时间内使用该集合进行方法3 - 即集合交集,您将获得最佳性能。
我没有包括在O(nlogn)中对大列表进行排序,然后使用bisect O(mlogn)进行搜索,因为这不会比O(m)的方法3更好。
以下是用于查找相对性能的示例代码。输入您的样本大小并查看它在您的机器配置上的表现。
import timeit
import time
import random
import string
from random import randint
from random import sample

data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    ans = set()
    for single_tuple in tuples_to_search:
        if single_tuple in big_tuple_list:
            ans.add(single_tuple)
    return ans

def set_conversion_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    big_tuple_set = set(big_tuple_list)
    ans = set()
    for single_tuple in tuples_to_search:
        if single_tuple in big_tuple_set:
            ans.add(single_tuple)
    return ans

def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    tuples_to_search_set = set(tuples_to_search)
    big_tuple_set = set(big_tuple_list)
    return tuples_to_search_set.intersection(big_tuple_set)

def search_in_converted_set(tuples_to_search, big_tuple_set):
    ans = set()
    for single_tuple in tuples_to_search:
        if single_tuple in big_tuple_set:
            ans.add(single_tuple)
    return ans

def intersect_search_tuples_in_converted_set(tuples_to_search, big_tuple_set):
    tuples_to_search_set = set(tuples_to_search)
    return tuples_to_search_set.intersection(big_tuple_set)
    
def get_big_data(big_tuple_list_size):
    big_tuple_list = [(
                        data_values[randint(random_lower_limit, random_upper_limit)], 
                        data_values[randint(random_lower_limit, random_upper_limit)], 
                        data_values[randint(random_lower_limit, random_upper_limit)]
                    ) for _ in range(big_tuple_list_size)]
    return big_tuple_list

def get_small_data(big_tuple_list, tuples_to_search_size):
    return sample(big_tuple_list, tuples_to_search_size)

def get_required_data(big_tuple_list_size, tuples_to_search_size):
    big_tuple_list = get_big_data(big_tuple_list_size)
    tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
    return (big_tuple_list, tuples_to_search)

def convert_list_to_set(big_list):
    return set(big_list)

if __name__ == '__main__':

    test_combinations = [(10, 5), (100, 5), (1000, 5)]

    functions_to_test = [
                        normal_search_multiple_tuples_in_tuples_list,
                        set_conversion_search_tuples_in_tuples_list,
                        intersect_search_tuples_in_tuples_list
                        ]

    for big_tuple_list_size, tuples_to_search_size in test_combinations:
        tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)

        print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
        for func in functions_to_test:
            print(f'''Time taken by {func.__name__}: {timeit.timeit('func(tuples_to_search, big_tuple_list)', globals=globals())}''')
        print()

    big_tuple_list = get_big_data(1000000)

    before = time.time()
    big_tuple_set = convert_list_to_set(big_tuple_list)
    print(f'Time taken for one-time job of converting list to set: {time.time() - before}')

    for small_sample_size in [5, 10, 50, 100]:
        tuples_to_search = get_small_data(big_tuple_list, small_sample_size)
        for func in [search_in_converted_set, intersect_search_tuples_in_converted_set]:
            print(f'''Time taken by {func.__name__} for searching {small_sample_size} tuples: {timeit.timeit('func(tuples_to_search, big_tuple_set)', globals=globals())}''')
        print()

对于我的设备,我得到了以下统计数据(默认情况下,timeit运行1M操作,请指定数字参数以更改该值)。对于1M个项目,内存占用不超过85M。

For a run of searching 5 needles in 10 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 1.972483009998541
Time taken by set_conversion_search_tuples_in_tuples_list: 1.5855916219989012
Time taken by intersect_search_tuples_in_tuples_list: 1.448762740001257

For a run of searching 5 needles in 100 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 18.045348233001278
Time taken by set_conversion_search_tuples_in_tuples_list: 6.803115938000701
Time taken by intersect_search_tuples_in_tuples_list: 6.169837320001534

For a run of searching 5 needles in 1000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 177.56795450999925
Time taken by set_conversion_search_tuples_in_tuples_list: 56.36051504599891
Time taken by intersect_search_tuples_in_tuples_list: 46.09082598700115

Time taken for one-time job of converting list to set: 0.17918157577514648
Time taken by search_in_converted_set for searching 5 tuples: 1.241585361000034
Time taken by intersect_search_tuples_in_converted_set for searching 5 tuples: 1.01804721499866

Time taken by search_in_converted_set for searching 10 tuples: 2.011633182002697
Time taken by intersect_search_tuples_in_converted_set for searching 10 tuples: 1.3614019449996704

Time taken by search_in_converted_set for searching 50 tuples: 9.395802918999834
Time taken by intersect_search_tuples_in_converted_set for searching 50 tuples: 5.388387926999712

Time taken by search_in_converted_set for searching 100 tuples: 19.021971608002787
Time taken by intersect_search_tuples_in_converted_set for searching 100 tuples: 12.412382661001175

我对你的基准测试有一些评论,这些评论太长了无法在此处列出,但你可以查看我的答案中的更新2 - Booboo

0
我会将元组列表简单地分成N个子列表,其中N是我的计算机上处理器的数量,创建一个大小为N的多进程池,然后并行搜索N个子列表。我还将使用一个集合search_list初始化池中的每个进程,以便测试子列表的元素是否是所需元素之一成为O(1)操作(如果search_list实际上比2个元素更大)。然后工作函数search_the_tuple将传递的列表转换为一个集合,并执行与search_list的集合交集,并返回结果。一旦第一个并行进程找到一个非零长度的集合返回,剩余的进程就会终止。
问题是将子列表转换为集合并执行交集,这是由库例程完成的,是否比循环遍历子列表并测试集合成员资格快得多,这将主要是Python字节码。
import multiprocessing as mp


def init_pool(s_list):
    global search_list
    search_list = set(s_list)

def search_the_tuple(t_sub_lib):
    for item in t_sub_lib:
        if item in search_list:
           return(item)


def split(lst, n):
    # a generator expression to split the library of tuples into n (narly) equal sub lists:
    k, m = divmod(len(lst), n)
    return (lst[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))


def main():

    search_list = [('a','a','1'), ('m', '1', 'l')]

    tuple_library = [
        ('g', 'z', '1'), ('h', '3', 'b'), ('i', 'a', 'l'), ('j', 'z', '1'), ('k', '3', 'b'), ('l', '1', 'l'),
        ('m', 'z', '1'), ('n', '3', 'b'), ('o', '1', 'l'), ('p', 'z', '1'), ('q', '3', 'b'), ('r', '1', 'l'),
        ('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l'), ('d', 'z', '1'), ('e', '3', 'b'), ('f', '1', 'l')
    ]

    n_processors = mp.cpu_count()
    with mp.Pool(n_processors, initializer=init_pool, initargs=(search_list,)) as pool:
        t_sub_libs = split(tuple_library, n_processors)
        for t in pool.imap_unordered(search_the_tuple, t_sub_libs):
            if t is not None:
                print(t)
                break
    # When we exit this block, terminate will be called on pool and
    # all processes will be killed.

# required for Windows:
if __name__ == '__main__':
    main()

输出:

('m', '1', 'l')

更新

我做了一个小基准测试,看看把列表转换成集合并执行交集是否值得。在这个基准测试中,我有一个包含20,000个元素的列表,并将要查找的元素放在最后,以让转换成集合有优势:

import timeit

s = set([(19999, 19999), (21000, 21000)])
lst = [(i,i) for i in range(20000)]

def s1():
    for i in lst:
        if i in s:
            return i

def s2():
    s2 = set(lst)
    return s & s2

print(timeit.timeit('s1()', number=1000, globals=globals()))
print(timeit.timeit('s2()', number=1000, globals=globals()))

输出:

0.9409163000000262
1.8975496000000476

但即使是在这里,集合转换版本s2的速度也几乎慢了一倍。但当我要查找的值位于列表中间时(平均情况下会是这样),我们有:

import timeit

s = set([(10000, 10000), (21000, 21000)])
lst = [(i,i) for i in range(20000)]

def s1():
    for i in lst:
        if i in s:
            return i

def s2():
    s2 = set(lst)
    return s & s2

print(timeit.timeit('s1()', number=1000, globals=globals()))
print(timeit.timeit('s2()', number=1000, globals=globals()))

打印:

0.5094996000000265
1.8004526000001988

正如您所预期的那样,s1 现在运行速度几乎快了一倍。

结论

  1. 您应该将 search_list 转换为集合,因为它将被多次搜索。这可能是您唯一可用的优化。
  2. 您只指定了 tuple_library 是非常大的。这相当不具体。我的基准测试表明,将您的 tuple_library 转换为集合并没有优势,因为它只被搜索一次。
  3. 虽然我已经展示了如何使用 multiprocessing 进行“分而治之”,但我并不清楚(再次不知道您的 tuple_library 实际上有多大,甚至可能无关紧要),multiprocessing 是否会提高性能(它甚至可能会降低性能)。创建进程池时存在开销,将子列表传递到“生活”在不同地址空间中的进程时也存在开销。您只需要尝试一下,看看它是否有帮助。

更新 2

@lllrnr101的基准测试存在问题。我原本想将此作为评论提供,但很遗憾,这篇内容太长了。首先,函数normal_search_multiple_tuples_in_tuples_list应该编写为:

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    s = set(tuples_to_search)
    for t in big_tuple_list:
        if t in s:
            return t

请注意,在 OP 的问题中,OP 在找到第一个匹配项后返回,并且并没有尝试查找所有匹配项。将 tuples_to_search 至少转换为集合可能更有效,因为它将被多次搜索。
其次,真正需要比较的只有两种策略:normal_search_multiple_tuples_in_tuples_listintersect_search_tuples_in_tuples_list。后者涉及首先将列表转换为集合并进行搜索的基准测试并没有正确考虑转换的成本,并且与实际问题不符(打印转换列表为集合的时间是单次转换的成本,而其他打印的时间是 100000 次迭代的成本)。
第三,OP 指定了问题中的列表非常大,无论什么意思,但我认为它比 10、100 和 1000 更大。因此,应该针对更大的列表进行基准测试,使用 timeit 进行较少的迭代,以便在合理的时间内完成。这导致以下基准测试:
import timeit
import time
import random
import string
from random import randint
from random import sample

data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    s = set(tuples_to_search)
    for t in big_tuple_list:
        if t in s:
            return t


def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    tuples_to_search_set = set(tuples_to_search)
    big_tuple_set = set(big_tuple_list)
    return tuples_to_search_set.intersection(big_tuple_set)


def get_big_data(big_tuple_list_size):
    big_tuple_list = [(
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)]
                    ) for _ in range(big_tuple_list_size)]
    return big_tuple_list

def get_small_data(big_tuple_list, tuples_to_search_size):
    return sample(big_tuple_list, tuples_to_search_size)

def get_required_data(big_tuple_list_size, tuples_to_search_size):
    big_tuple_list = get_big_data(big_tuple_list_size)
    tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
    return (big_tuple_list, tuples_to_search)


if __name__ == '__main__':

    test_combinations = [(10000, 5), (20000, 5), (100000, 5)]

    functions_to_test = [
                        normal_search_multiple_tuples_in_tuples_list,
                        intersect_search_tuples_in_tuples_list
                        ]

    for big_tuple_list_size, tuples_to_search_size in test_combinations:
        tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)

        print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
        for func in functions_to_test:
            print(f'''Time taken by {func.__name__}: {timeit.timeit('func(tuples_to_search, big_tuple_list)', number=1000, globals=globals())}''')
        print()

输出:

For a run of searching 5 needles in 10000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.681931
Time taken by intersect_search_tuples_in_tuples_list: 0.5806513

For a run of searching 5 needles in 20000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.8495027999999998
Time taken by intersect_search_tuples_in_tuples_list: 0.8799418999999999

For a run of searching 5 needles in 100000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 8.51018
Time taken by intersect_search_tuples_in_tuples_list: 8.115366799999999

时间上没有太大的差异。我的基准测试显示,在列表中寻找一个中间元素的情况下,不将列表转换为集合要比转换为集合快得多。但是这个基准测试和我的区别在于,我的基准测试不会将tuples_to_search列表转换为集合;这个集合是预先创建的,因为 OP 知道他们正在寻找什么。如果我们从基准测试中删除创建该集合的成本,我们有:

import timeit
import time
import random
import string
from random import randint
from random import sample

data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    for t in big_tuple_list:
        if t in tuples_to_search:
            return t


def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    big_tuple_set = set(big_tuple_list)
    return tuples_to_search.intersection(big_tuple_set)


def get_big_data(big_tuple_list_size):
    big_tuple_list = [(
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)]
                    ) for _ in range(big_tuple_list_size)]
    return big_tuple_list

def get_small_data(big_tuple_list, tuples_to_search_size):
    return sample(big_tuple_list, tuples_to_search_size)

def get_required_data(big_tuple_list_size, tuples_to_search_size):
    big_tuple_list = get_big_data(big_tuple_list_size)
    tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
    return (big_tuple_list, tuples_to_search)


if __name__ == '__main__':

    test_combinations = [(10000, 5), (20000, 5), (100000, 5)]

    functions_to_test = [
                        normal_search_multiple_tuples_in_tuples_list,
                        intersect_search_tuples_in_tuples_list
                        ]

    for big_tuple_list_size, tuples_to_search_size in test_combinations:
        tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)

        print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
        s = set(tuples_to_search)
        for func in functions_to_test:
            print(f'''Time taken by {func.__name__}: {timeit.timeit('func(s, big_tuple_list)', number=1000000, globals=globals())}''')
        print()

输出:

For a run of searching 5 needles in 10000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.16306289999999998
Time taken by intersect_search_tuples_in_tuples_list: 0.6508408

For a run of searching 5 needles in 20000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.16933260000000006
Time taken by intersect_search_tuples_in_tuples_list: 0.6473307000000001

For a run of searching 5 needles in 100000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.1777871999999996
Time taken by intersect_search_tuples_in_tuples_list: 0.7130949000000002

你可以清楚地看到,不应该将列表转换为集合。最后,如果这还不令人信服,这里是我的基准测试,在其中我对我们试图匹配搜索函数中的列表元素进行了集合转换,并且我们正在搜索的列表元素与这些元素之一匹配的元素是最后一个元素。这应该给使用集合交集方法的方法带来最大的优势:

import timeit


def s1(s, lst):
    s = set(s)
    for i in lst:
        if i in s:
            return i

def s2(s, lst):
    s = set(s)
    s2 = set(lst)
    return s & s2


lst = [(i,i) for i in range(20000)]
s = [(19999, 19999), (21000, 21000)] # match will be last element of lst
# look for a s in lst:
print(timeit.timeit('s1(s, lst)', number=1000, globals=globals()))
print(timeit.timeit('s2(s, lst)', number=1000, globals=globals()))

输出:

0.887862600000517
1.8508867000000464

0
总之,使用set。如果可能的话,请提前计算集合。
第一列是原始方法。第二个是二分查找。第三个是集合交集。第四个是预先计算集合的交集。
请注意,10k和100k样本大小之间的第一列和第四列的差异。 由于样本大小大于100k样本中的样本空间,线性搜索next((s for s in t_lib if s in intersect), None)消耗更多时间。这是因为输入项不唯一。
demo       0.14000 0.36631 0.43345 0.32987
Sample size 1
avg        0.20465 0.51461 0.78238 0.62021
min        0.20071 0.45612 0.77142 0.60785
max        0.20777 0.53482 0.79569 0.63588
Sample size 10
avg        0.29292 0.36985 0.15805 0.10697
min        0.29023 0.33839 0.15505 0.10555
max        0.29462 0.38843 0.16004 0.10892
Sample size 100
avg        2.25065 0.57967 0.09441 0.05153
min        0.31502 0.22779 0.07239 0.02602
max        2.48248 0.62714 0.09965 0.05538
Sample size 1000
avg        1.04594 0.35947 0.06731 0.01942
min        0.10262 0.33380 0.06457 0.01773
max        3.87742 0.43988 0.07285 0.02422
Sample size 10000
avg        0.05340 0.48607 0.11466 0.03703
min        0.00144 0.47951 0.11063 0.03608
max        0.17092 0.49262 0.12124 0.03801
Sample size 100000
avg        0.01492 0.63902 0.18956 0.04588
min        0.00152 0.63142 0.18007 0.04313
max        0.03624 0.65317 0.21257 0.05027
Sample size 1000000
avg        0.00236 2.98343 0.71825 0.02207
min        0.00008 2.93934 0.69734 0.01961
max        0.00484 3.05418 0.74362 0.02691

代码:

from bisect import bisect_left
from functools import partial
from random import Random
from string import ascii_lowercase, digits
from timeit import timeit

LETTERS = ascii_lowercase + digits


def next_triplet(r=Random(123456)):  # fixed seed
    return tuple(r.choices(LETTERS, k=3))  # len(sample space) = 36 ** 3 = 46656


def search_the_tupple(t_lib, s_list):
    for item in t_lib:
        if item in s_list:
            return (item)


def search_the_tupple_bisect(t_lib, s_list):
    s_list = sorted(s_list)
    size = len(s_list)

    for item in t_lib:
        i = bisect_left(s_list, item)

        if i < size and s_list[i] == item:
            return item


def search_the_tupple_set(t_lib, s_list):
    intersect = set(t_lib).intersection(s_list)
    return next((s for s in t_lib if s in intersect), None)


def search_the_tupple_set_precomputed(t_lib, t_set, s_set):
    intersect = t_set.intersection(s_set)
    return next((s for s in t_lib if s in intersect), None)


def run(tuple_library, search_list):
    number = max(5, int(3 * 1000000 / (len(tuple_library) + len(search_list))))  # keep the runtime manageable

    functions = (
        # pack the test function and arguments together
        partial(search_the_tupple, tuple_library, search_list),
        partial(search_the_tupple_bisect, tuple_library, search_list),
        partial(search_the_tupple_set, tuple_library, search_list),
        partial(search_the_tupple_set_precomputed, tuple_library, set(tuple_library), set(search_list)),
    )

    results = [f() for f in functions]
    assert len(set(results)) == 1, results  # make sure no error

    return [timeit(f, number=number) for f in functions]


def main():
    def print_timing(title, timings):
        print(f'{title:10s}', ' '.join(f'{t:.5f}' for t in timings))

    tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]
    search_list = [('a', 'a', '1'), ('m', '1', 'l')]

    print_timing('demo', run(tuple_library, search_list))

    for sample_size in [10 ** s for s in range(7)]:  # 10 ** 5 = 100_000, which is greater than the sample space
        print('Sample size', sample_size)

        timing_results = []

        for run_index in range(10):
            tuple_library = [next_triplet() for _ in range(sample_size)]
            search_list = [next_triplet() for _ in range(sample_size)]
            timing_results.append(run(tuple_library, search_list))

        timing_min = []
        timing_max = []
        timing_avg = []

        for fn_index, fn_timings in enumerate(zip(*timing_results)):
            timing_avg.append(sum(fn_timings) / len(fn_timings))
            timing_min.append(min(fn_timings))
            timing_max.append(max(fn_timings))

        print_timing('avg', timing_avg)
        print_timing('min', timing_min)
        print_timing('max', timing_max)


if __name__ == '__main__':
    main()

0

首先,让我们分析标准方法在搜索包含n个元素的元组列表中的另一个元组列表时的时间复杂度。

现在,要将两个元组发音为相等,我们需要比较一个元组的n个元素与另一个元组的n个元素,这需要n次比较。

现在,我们必须对搜索列表中的所有元素执行此操作的数量进行操作 - 假设为m。

现在,我们必须对元组库中的所有元素执行上述2个操作 - 假设为p。

因此,这里的总操作数为n * m * p。

现在,如果元组库很大,则执行此搜索操作所需的时间非常长。

为了减少这种情况,我们可以采用以下方法:

  1. 首先,取出搜索列表('a','a','1')的第一个元组,并将其第一个元素(在本例中为'a')与tuple_library中所有元组的第一个元素进行比较。
  2. 将匹配的元组放入哈希表或字典中,其中键是第一个元素('a'),值是匹配元组的列表,并从原始搜索列表中删除这些元组。
  3. 现在,继续处理该元组中的下一个元素(再次是'a'),并将其与上述列表中找到的所有元素进行比较。这样,您正在搜索一个更小的列表。
  4. 对于所有匹配的元素,将它们放入字典中,键由两个元素组成(在本例中为'aa'),并从早期列表中删除它们,直到匹配该元组中的所有元素为止。
  5. 完成此操作后,您的字典将具有3个键('a','aa'和'aa1'),每个键都包含包含它们的元组列表。
  6. 接下来,您继续处理搜索列表中的下一个元组,并取出其第一个元素-如果存在于字典中,则取下一个元素-查看两者是否都存在-以此类推,直到匹配元组中的所有元素或找到不匹配为止,在这种情况下,您将使用由所有匹配元素组成的键引用的列表进行搜索。
  7. 如果元组的第一个元素未出现在字典中,则重新开始该过程。
这个过程的结果是,对于搜索列表中的每个后续元组,您都必须从元组库中搜索越来越少的元组。因此,您不必搜索n个元素,而是只需搜索log(n)个元素。
因此,您的总运行时间变为log(n) * m * p。现在,由于最大的列表是元组库,在那里减少搜索次数将大大降低时间复杂度。

-2
创建一个字典。使用你的tuples_library中的元组作为字典键。然后检查搜索列表中的元组是否存在于字典中作为键。 在什么情况下会使用元组作为字典键?
tuple_search_dict = {
    ('a', 'z', '1'): ["present"],
    ('r', '3', 'b'): ["present"],
    ('m', '1', 'l'): ["present"],
}

tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]
search_list = [('a', 'a', '1'), ('m', '1', 'l')]

for each in search_list:
    try:
        tuple_search_dict[each]
        print(f"This tuple EXISTS in the list: {each}")
    except KeyError:
        print(f"This tuple DOES NOT EXIST in your list: {each}")

1
这只是简单地复制了“set”建议。 - Prune
没有复制,只是检查键是否存在。示例代码已添加。 - django-d

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接