在两个Pandas数据框中查找共同的行（交集）

Question

在两个Pandas数据框中查找共同的行（交集）

94

假设我有两个这种格式的数据框（称为df1和df2）：

+------------------------+------------------------+--------+
|        user_id         |      business_id       | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA |      3 |
+------------------------+------------------------+--------+

我希望得到一个由在df1和df2中都有的user_id所对应的所有行组成的数据框。(例如，如果一个user_id同时出现在df1和df2中，则将这两行包含在输出数据框中)

我想到了很多方法，但它们都让我感觉不够简洁。例如，我们可以在每个数据框中找到所有独特的user_id，创建每个集合，找到它们的交集，使用结果集筛选两个数据框，并连接两个筛选后的数据框。

也许那是最好的方法，但我知道Pandas非常聪明。有没有更简单的方法来做到这一点？我看过merge，但我认为那不是我需要的。

- David Chouinard

5个回答

19

如果我理解正确，您可以结合使用 Series.isin() 和 DataFrame.append()：

In [80]: df1
Out[80]:
   rating  user_id
0       2  0x21abL
1       1  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
5       2  0x21abL
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
9       1  0x21abL

In [81]: df2
Out[81]:
   rating      user_id
0       2      0x1d14L
1       1    0xdbdcad7
2       1      0x21abL
3       3      0x21abL
4       3      0x21abL
5       1  0x5734a81e2
6       2      0x1d14L
7       0       0xdafL
8       0      0x1d14L
9       4  0x5734a81e2

In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)

In [83]: ind
Out[83]:
0     True
1    False
2     True
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: user_id, dtype: bool

In [84]: df1[ind].append(df2[ind])
Out[84]:
   rating  user_id
0       2  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
0       2  0x1d14L
2       1  0x21abL
3       3  0x21abL
4       3  0x21abL
6       2  0x1d14L
7       0   0xdafL
8       0  0x1d14L

这本质上是您描述为“笨拙”的算法，使用惯用的pandas方法。请注意重复的行索引。还要注意，如果df1和df2没有重叠的行索引，即：

In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')

事实上，如果它们的行索引不相等，它将无法产生预期输出。

- Phillip Cloud

1

啊，我曾经考虑过这个，但它并不能给我想要的结果。我希望在输出数据框中将这两行作为两个单独的行。这个解决方案会使列数加倍并使用前缀。我不认为有一种方法可以使用“merge”来创建这两个单独的行。 - David Chouinard

+1 用于合并，但看起来 OP 想要略有不同的输出。我已经创建了一个看起来像他需要的，但我不确定它是否是最优雅的 pandas 解决方案。 - Roman Pekar

2

这仅在 df1 和 df2 长度相同时才有效。否则，df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id) 将无法一起广播。 - Korem

2

我认为可以使用 df1 = df1[df1.index.isin(df2.index)]; df2 = df2[df2.index.isin(df1.index)] 来解决。 - Korem

为什么需要这里的“＆”？df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id) - Antonio Sesto

5

在SQL中，可以通过几种方法来解决这个问题：

select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)

或者在SQL Server中连接并将其展开（可能性）

select
    df1.user_id,
    c.rating
from df1
    inner join df2 on df2.user_i = df1.user_id
    outer apply (
        select df1.rating union all
        select df2.rating
    ) as c

第二种方法可以使用类似于pandas的方式编写，例如：

>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
   user_id  rating
0        3      20
0        3      30

- Roman Pekar

第二种方法能够被优化/缩短吗？ - Aqua 4

2

您可以使用 pd.Index.intersection 来处理 n 个数据框和 k 列：

import pandas as pd
from functools import reduce
from typing import Union

def dataframe_intersection(
    dataframes: list[pd.DataFrame], by: Union[list, str]
) -> list[pd.DataFrame]:
    set_index = [d.set_index(by) for d in dataframes]
    index_intersection = reduce(pd.Index.intersection, [d.index for d in set_index])
    intersected = [df.loc[index_intersection].reset_index() for df in set_index]

    return intersected

df1 = pd.DataFrame({"user_id":[1,2,3], "business_id": ['a', 'b', 'c'], "rating":[10, 15, 20]})
df2 = pd.DataFrame({"user_id":[3,4,5], "business_id": ['c', 'd', 'e'], "rating":[30, 35, 40]})
df3 = pd.DataFrame({"user_id":[3,3,3], "business_id": ['f', 'c', 'f'], "rating":[50, 70, 80]})

df_list = [df1, df2, df3]

这会给出

>>> pd.concat(dataframe_intersection(df_list, by='user_id'))
   user_id business_id  rating
0        3           c      20
0        3           c      30
0        3           f      50
1        3           c      70
2        3           f      80

和

>>> pd.concat(dataframe_intersection(df_list, by=['user_id', 'business_id']))
   user_id business_id  rating
0        3           c      20
0        3           c      30
0        3           c      70

- xyzzyqed

0

This is simple solution:

df1[df1 == df2].dropna()

- Rahul Sharma

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- aldorath · Accepted Answer

我理解这个问题最好在此帖子中回答。

但简要来说，使用这种方法回答OP的答案很简单：

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

生成一个具有5列的s1：分别从df1和df2中选取两列以及user_id列。