如何高效地检查一个字符串是否包含两个列表中至少一个元素

3
我有两个列表和一个句子列表,如下所示。
list1 = ['data mining', 'data sources', 'data']
list2 = ['neural networks', 'deep learning', 'machine learning']

sentences = ["mining data using neural networks has become a trend", "data mining is easy with python", "machine learning is my favorite", "data mining and machine learning are awesome", "data sources and data can been used for deep learning purposes", "data, deep learning and neural networks"]

我希望挑选那些同时包括list1list2元素的句子。即输出应为:

["mining data using neural networks has become a trend", "data mining and machine learning are awesome", "data sources and data can been used for deep learning purposes", "data, deep learning and neural networks"]

我的当前代码如下。

for sentence in sentences:
    for terms in list1:
        for words in list2:
           if terms in sentence:
               if words in sentence:
                     print(sentence)

然而,该代码的时间复杂度为O(n^3),效率不高。有没有更有效率的在Python中实现的方法?

如有需要,我很乐意提供更多细节。


那样可以_最大化_您的复杂度/运行时间..请查看以下答案(集)以获得更好的方法。 - WestCoastProjects
它可能并没有真正降低理论复杂度,但一种方法可以是正则表达式。检查句子是否与由 '|'.join(list1)(为了清晰起见省略转义)创建的正则表达式模式匹配,那么您就知道该句子至少包含 list1 中的一个项目。然后对 list2 做同样的操作。 - Michael Butscher
3个回答

4

集合比列表更有效率。如果你想要查找包含两个“列表”中单个单词的句子,你可以使用交集符号(&)检查每个句子与这两个“列表”的交集,而不是使用嵌套循环 if

list1 = set(list1)
list2 = set(list2)
[sentence for sentence in set(sentences.split()) if sentence & list1 & list2]

然而,由于您的列表似乎包含短语(或单词序列),因此很难避免使用多个循环。如果找到或未找到匹配项,可以至少从循环中跳出或继续执行操作。对于你要匹配的两个列表中的循环,也没有必要将它们嵌套在彼此之内。
result = []
for sentence in sentences:
    for word in list1:
        if word in sentence:
            break
    else:
        continue
    for word in list2:
        if word in sentence:
            break
    else:
        continue
    result.append(sentence)

结果:

['mining data using neural networks has become a trend',
 'data mining and machine learning are awesome',
 'data sources and data can been used for deep learning purposes',
 'data, deep learning and neural networks']

1
这个答案的逻辑是如何工作的?例如,bool({1,2,3} & {1,2} & {3}) 返回 False - iz_
集合操作在这里无法帮助,因为一个句子字符串可能包含一个列表项作为子字符串。这不能通过集合操作进行测试。 - Michael Butscher
@busybear,你的新解决方案不起作用。我只得到了['使用神经网络进行数据挖掘已成为一种趋势'] - iz_
@Tomothy32 哦,你说得对。我需要一个 continue 而不是 break - busybear
“matches”现在已经不必要了,因为只有在没有发生“continue”之前才能到达“result.append”。 - Michael Butscher
显示剩余2条评论

4
你可以利用allany的短路特性来提高性能:
list1 = ['data mining', 'data sources', 'data']
list2 = ['neural networks', 'deep learning', 'machine learning']
sentences = ["mining data using neural networks has become a trend", "data mining is easy with python", "machine learning is my favorite", "data mining and machine learning are awesome", "data sources and data can been used for deep learning purposes", "data, deep learning and neural networks"]

for sentence in sentences:
    if all(any(term in sentence for term in lst) for lst in (list1, list2)):
        print(sentence)

2
尝试减少像这样的循环:

尽量避免以下类似代码:

list1 = ['data mining', 'data sources', 'data']
list2 = ['neural networks', 'deep learning', 'machine learning']

sentences = ["mining data using neural networks has become a trend", "data mining is easy with python", "machine learning is my favorite", "data mining and machine learning are awesome", "data sources and data can been used for deep learning purposes", "data, deep learning and neural networks"]

matches_list_1 = set()
matches_list_2 = set()

for index, sentence in enumerate(sentences):
    for terms in list1:
        if terms in sentence:
            matches_list_1.add(index)
    for terms in list2:
        if terms in sentence:
            matches_list_2.add(index)

for index in (matches_list_1 & matches_list_2):
    print(sentences[index])


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接