在Python列表中查找交集/差异

Question

在Python列表中查找交集/差异

10

我有两个Python列表：

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

b = ['the', 'when', 'send', 'we', 'us']

我需要过滤掉所有与 b 中相似的 a 元素。在这种情况下，我应该得到：

c = [('why', 4), ('throw', 9), ('you', 1)]

什么是最有效的方法？

- khan

为什么不使用intersection方法？它可以用于集合，但你可能可以让它更好地工作 ;) - Henrik Andersson

为什么这个问题被标记为numpy？你需要一个numpy的解决方案吗？ - bmu

7个回答

5

一种列表推导式应该可以解决问题：

c = [item for item in a if item[0] not in b]

或者使用字典推导式：

d = dict(a)
c = {key: value for key in d.iteritems() if key not in b}

- Blender

你是否想要{key: value for key, value in d.iteritems() if key not in b}？ - Eric

2

in 很好，但至少在 b 上应该使用集合。如果您有 numpy，当然也可以尝试使用 np.in1d，但是它是否更快，您应该自行尝试。

# ruthless copy, but use the set...
b = set(b)
filtered = [i for i in a if not i[0] in b]

# with numpy (note if you create the array like this, you must already put
# the maximum string length, here 10), otherwise, just use an object array.
# its slower (likely not worth it), but safe.
a = np.array(a, dtype=[('key', 's10'), ('val', int)])
b = np.asarray(b)

mask = ~np.in1d(a['key'], b)
filtered = a[mask]

集合还有一些方法，比如difference等，在这里可能没有用，但通常情况下很有用。

- seberg

+1 for numpy。在发布我的答案之前没有看到你的答案。对于更大的数据集，in1d比列表推导式快2倍。 - bmu

2

由于这被标记为numpy，因此这里提供一个使用numpy.in1d进行基准测试的numpy解决方案：

In [1]: a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

In [2]: b = ['the', 'when', 'send', 'we', 'us']

In [3]: a_ar = np.array(a, dtype=[('string','|S5'), ('number',float)])

In [4]: b_ar = np.array(b)

In [5]: %timeit filtered = [i for i in a if not i[0] in b]
1000000 loops, best of 3: 778 ns per loop

In [6]: %timeit filtered = a_ar[-np.in1d(a_ar['string'], b_ar)]
10000 loops, best of 3: 31.4 us per loop

对于5条记录，列表推导式更快。

然而，对于大型数据集，NumPy解决方案比列表推导式快两倍：

In [7]: a = a * 1000

In [8]: a_ar = np.array(a, dtype=[('string','|S5'), ('number',float)])

In [9]: %timeit filtered = [i for i in a if not i[0] in b]
1000 loops, best of 3: 647 us per loop

In [10]: %timeit filtered = a_ar[-np.in1d(a_ar['string'], b_ar)]
1000 loops, best of 3: 302 us per loop

- bmu

0

试试这个：

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

b = ['the', 'when', 'send', 'we', 'us']

c=[]

for x in a:
    if x[0] not in b:
        c.append(x)
print c

演示：http://ideone.com/zW7mzY

- Arpit

反向操作： OP 希望 c 包含不在 b 中的内容。 - Eric

1

这似乎是“C++方式”，而不是“Python方式” ;) - yo'

@tohecz，C++不支持in运算符。 - Arpit

@Arpit 不是，但本质上使用循环进行容器操作，而 Python 本质上不应该这样做。 - yo'

我仍然支持交集！：] - Henrik Andersson

0

简单的方法

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]
b = ['the', 'when', 'send', 'we', 'us']
c=[] # a list to store the required tuples 
#compare the first element of each tuple in with an element in b
for i in a:
    if i[0] not in b:
        c.append(i)
print(c)

- user13320096

-1

使用过滤器：

c = filter(lambda (x, y): False if x in b else True, a)

- Rahul Banerjee

1

X in Y 在 Python 中本身就是一个布尔语句。 - thkang

2

@RahulBanerjee False if ... else True 这个表达式过于复杂，难以阅读 - 直接使用 lambda (x, y): x not in b。此外，在 Python 3 中会导致语法错误- 必须这样写 lambda x: x[0] not in b 因为你使用的参数解包形式已经不再是该语言的一部分了。 - lvc

1

这里的问题部分在于filter(lambda:...)本质上很难阅读（相对于筛选推导式而言）。可以推测，您更喜欢使用这种表示法是因为它包含了一个if条件。 - Eric

@Eric 噢，真的吗，“filter(lambda”很难读吗？这可能只是你的观点，这确实取决于经验。然而，使用带有lambda的过滤器的唯一缺点是性能。生成器和纯if的速度更快。 - Reishin

@Reishin：这不仅仅是我的观点——Guido本人也想要删除它们。在这里可以看到相关帖子：https://dev59.com/qHA75IYBdhLWcg3w8-HZ#3013722 - Eric

显示剩余6条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Octipi · Accepted Answer

11

一个列表推导式可以解决。

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]
b = ['the', 'when', 'send', 'we', 'us']
filtered = [i for i in a if not i[0] in b]

>>>print(filtered)
[('why', 4), ('throw', 9), ('you', 1)]

- Octipi

这是一种更优雅的方法，可以将列表保持为列表，而不将它们视为字典...感谢您的帮助。 - khan

如果您使用in运算符，应将b转换为set。这将把查找时间从线性变为常数，当b是一个长列表时，这将产生巨大的差异。因此，c = set(b)，然后filtered = [i for i in a if not i[0] in c]。请注意，最后一行中的b变成了c。即使在这个只有5个项目的短列表中，对我来说也会产生25%的速度提升。对于更长的列表（b中有100个项目），它会产生90%的速度提升。 - Carl