基于列值合并pandas数据框

Question

基于列值合并pandas数据框

27

我对pandas数据框还很陌生，正在尝试将两个表格连接起来时遇到了一些问题。

第一个数据框只有3列：

DF1:

item_id    position    document_id
336        1           10
337        2           10
338        3           10
1001       1           11
1002       2           11
1003       3           11
38         10          146

第二个表格与第一个表格有完全相同的两列（还有许多其他列）:

DF2:

item_id    document_id    col1    col2   col3    ...
337        10             ...     ...    ...
1002       11             ...     ...    ...
1003       11             ...     ...    ...

我所需要的是执行一个操作，在SQL中看起来应该像这样：

DF1 join DF2 on 
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id

因此，我希望看到DF2，其中包括'position'列：

item_id    document_id    position    col1   col2   col3   ...

使用pandas有什么好的方法可以做到这一点？

- fremorie

2个回答

0

如果您正在使用OP中的所有公共列进行合并，则甚至不需要传递on=，只需调用merge()即可完成工作。

merged_df = df1.merge(df2)

原因是，如果没有传递on=参数，那么在底层（under the hood），将调用pd.Index.intersection函数来确定公共列并在所有这些列上进行合并。

关于在公共列上合并的一个特殊之处是，无论哪个数据框在右侧或左侧，过滤的行都是相同的，因为它们是通过查找公共列上匹配的行来选择的。唯一的区别是列的位置；在左侧数据框中不存在的右侧数据框中的列将添加到左侧数据框的列的右侧。因此，除非列的顺序很重要（可以使用列选择或reindex()轻松解决），否则实际上不太重要哪个数据框在右侧，哪个在左侧。换句话说，

df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)

# df12 and df21 are the same.
df12.equals(df21)     # True

如果要合并的列名不相同，并且您必须传递left_on=和right_on=（请参见this answer中的＃1），则情况就不是这样。

- cottontail

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

我认为您需要使用默认的inner join进行merge，但在两个列中不允许有重复的组合值：

print (df2)
   item_id  document_id col1  col2  col3
0      337           10    s     4     7
1     1002           11    d     5     8
2     1003           11    f     7     0

df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
   item_id  position  document_id col1  col2  col3
0      337         2           10    s     4     7
1     1002         2           11    d     5     8
2     1003         3           11    f     7     0

但如果需要，在位置3的位置上放置position列：

df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
   item_id  document_id  position col1  col2  col3
0      337           10         2    s     4     7
1     1002           11         2    d     5     8
2     1003           11         3    f     7     0