合并多个 Pandas 数据框。

3
df1 = pd.DataFrame({'a':['id1','id2','id3'],'b':['W','W','W'],'c1':[1,2,3]})
df2 = pd.DataFrame({'a':['id1','id2','id3'],'b':['W','W','W'],'c2':[4,5,6]})
df3 = pd.DataFrame({'a':['id1','id4','id5'],'b':['Q','Q','Q'],'c1':[7,8,9]})

我正在尝试将df1df2df3连接成一个数据框:
a    b   c1   c2
id1  W   1    4
id2  W   2    5
id3  W   3    6
id1  Q   7    NA
id4  Q   8    NA
id5  Q   9    NA

我尝试了:

l = [d.set_index(['a','b']) for d in [df1,df2,df3]]
pd.concat(l, axis=1)

但输出结果与我的期望不符:

        c1   c2   c1
a   b               
id1 W  1.0  4.0  NaN
id2 W  2.0  5.0  NaN
id3 W  3.0  6.0  NaN
id1 Q  NaN  NaN  7.0
id4 Q  NaN  NaN  8.0
id5 Q  NaN  NaN  9.0
5个回答

1
首先,通过基于列a和b合并df1和df2; df_try_1 = df1.merge(df2, on=["a","b"]) 然后将其与df3合并; df_try_2 = pd.concat([df_try_1, df3], axis=0) 结果为; 进入图像描述

1
你可以加入由DataFrame.stack创建的MultiIndex Series
l = [d.set_index(['a','b']).stack() for d in [df1,df2,df3]]
df = pd.concat(l).unstack().sort_index(level=[1,0], ascending=[False, True])
print (df)
        c1   c2
a   b          
id1 W  1.0  4.0
id2 W  2.0  5.0
id3 W  3.0  6.0
id1 Q  7.0  NaN
id4 Q  8.0  NaN
id5 Q  9.0  NaN

如果只有3列数据框,则可以使用DataFrame.squeeze或通过iloc[:, 0]选择第一列来获取系列列表:
l = [d.set_index(['a','b']).squeeze() for d in [df1,df2,df3]]
keys = [x.name for x in l]
df = (pd.concat(l, axis=0, keys=keys)
        .unstack(0)
        .sort_index(level=[1,0], ascending=[False, True]))
print (df)
        c1   c2
a   b          
id1 W  1.0  4.0
id2 W  2.0  5.0
id3 W  3.0  6.0
id1 Q  7.0  NaN
id4 Q  8.0  NaN
id5 Q  9.0  NaN

l = [d.set_index(['a','b']).iloc[:, 0] for d in [df1,df2,df3]]
keys = [x.name for x in l]
df = (pd.concat(l, axis=0, keys=keys)
        .unstack(0)
        .sort_index(level=[1,0], ascending=[False, True]))

另一个想法是通过DataFrame.combine_first将多个DataFrame链在列表中:

from functools import reduce

dfs = [d.set_index(['a','b']) for d in [df1,df2,df3]]
df = (reduce(lambda x, y: x.combine_first(y), dfs)
        .sort_index(level=[1,0], ascending=[False, True]))
print (df)
        c1   c2
a   b          
id1 W  1.0  4.0
id2 W  2.0  5.0
id3 W  3.0  6.0
id1 Q  7.0  NaN
id4 Q  8.0  NaN
id5 Q  9.0  NaN

0

尝试

a=df1.merge(df2[['a','c2']],on='a',how='left')
l=a.append(df3)

0

这应该适用于您的情况:

df = pd.merge(pd.merge(df1, df2, how='outer', on=['a', 'b']), df3, how='outer', on=['a', 'b'])
df.set_index(['a', 'b'], inplace=True)
df.columns = ['c1', 'c2', 'c3']
print(df)

结果:

        c1   c2   c3
a   b               
id1 W  1.0  4.0  NaN
id2 W  2.0  5.0  NaN
id3 W  3.0  6.0  NaN
id1 Q  NaN  NaN  7.0
id4 Q  NaN  NaN  8.0
id5 Q  NaN  NaN  9.0

0

我认为合并是你最好的选择。

df = df1.combine_first(df2)
pd.merge(df, df3, on=['a', 'b', 'c1'], how='outer')

这将产生预期的输出:
     a  b  c1   c2
0  id1  W   1  4.0
1  id2  W   2  5.0
2  id3  W   3  6.0
3  id1  Q   7  NaN
4  id4  Q   8  NaN
5  id5  Q   9  NaN

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接