两个列表中比较共同项的最快方法

3

我有两个像这样的列表:

listt = [["a","abc","zzz","xxx","abc","abc"],["yyy","ggg","abc","cccc"]]

我有另一个类似这样的查询列表:

queryList = ["abc","cccc","abc","yyy"]

queryListlistt[0]共有2个"abc"

queryListlistt[1]共有1个"abc",1个"cccc"和1个"yyy"

因此,我希望输出结果如下:

[2,3] #2 = Total common items between queryList & listt[0]
      #3 = Total common items between queryList & listt[1]

目前我正在使用循环来完成这个任务,但似乎速度很慢。我会有数百万个列表,每个列表中有数千个项目。

listt = [["a","abc","zzz","xxx","abc","abc"],["yyy","ggg","abc","cccc"]]
queryList = ["abc","cccc","abc","yyy"]

totalMatch = []
for hashtree in listt:
    matches = 0
    tempQueryHash = queryList.copy()
    for hash in hashtree:
        for i in range(len(tempQueryHash)):
            if tempQueryHash[i]==hash:
                matches +=1
                tempQueryHash[i] = "" #Don't Match the same block twice.
                break

    totalMatch.append(matches)
print(totalMatch)

那将是数千兆字节的数据,所以除了列表之外,您可能还想使用Python以外的其他东西... - Thomas
@mkrieger1 但查询列表仅包含两个“abc”。因此只有两个匹配。 - Rahul
你推荐哪种格式?@JohnGordon字典? - Rahul
@Thomas 我的数据在MySQL中...所以我想也许我应该找SQL解决方案,对吧?该死...我真蠢。 - Rahul
这个回答解决了你的问题吗?两个包含重复元素的列表的交集? - mkrieger1
显示剩余3条评论
3个回答

2

嗯,我还在学习Python的基础知识。但是根据Stack Overflow上这篇旧帖子的说法,以下类似的代码应该能够正常工作:

from collections import Counter
listt = [["a","abc","zzz","xxx","abc","abc"],["yyy","ggg","abc","cccc"]]
queryList = ["abc","cccc","abc","yyy"]
OutputList = [len(list((Counter(x) & Counter(queryList)).elements())) for x in listt]
# [2, 3]

我会继续寻找其他方法...


1
谢谢,它的速度快了2倍。 - Rahul
1
很高兴能够帮忙。别忘了也要感谢原始内容的发布者Rahul =)。链接已经包含在帖子中。 - JvdV

2

JvdV的回答中得到了改进。

基本上是对值进行求和而不是计算元素数量,并且缓存查询列表计数器。

from collections import Counter
listt = [["a","abc","zzz","xxx","abc","abc"],["yyy","ggg","abc","cccc"]]
queryList = ["abc","cccc","abc","yyy"]
queryListCounter = Counter(queryList)
OutputList = [sum((Counter(x) & queryListCounter).values()) for x in listt]

JvdV的答案稍微快一些,但你的缓存想法也很好。 - Rahul
有趣的是,sum(aCounter.values())len(list(aCounter.elements())) 更慢。 - Yosua

0
您可以列出listt和queryList的匹配项,并计算所做匹配的数量。
output = ([i == z for i in listt[1] for z in queryList])
print(output.count(True))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接