如何合并两个 Pandas DataFrame 并聚合一个特定列?

7

我有两个数据框:

         city  count    school
0    New York      1  school_3
1  Washington      1  School_4
2  Washington      1  School_5
3          LA      1  School_1
4          LA      1  School_4

         city  count    school
0    New York      1  School_3
1  Washington      1  School_1
2          LA      1  School_3
3          LA      2  School_4

我希望获得这个结果:
         city  count    school
0    New York      2  school_3
1  Washington      1  School_1
2  Washington      1  School_4
3  Washington      1  School_5
4          LA      1  School_1
5          LA      1  School_3
6          LA      3  School_4

以下是代码。
d1 = [{'city':'New York', 'school':'school_3', 'count':1},
      {'city':'Washington', 'school':'School_4', 'count':1},
      {'city':'Washington', 'school':'School_5', 'count':1},
      {'city':'LA', 'school':'School_1', 'count':1},
      {'city':'LA', 'school':'School_4', 'count':1}]


d2 = [{'city':'New York', 'school':'School_3', 'count':1},
      {'city':'Washington', 'school':'School_1', 'count':1},
      {'city':'LA', 'school':'School_3', 'count':1},
      {'city':'LA', 'school':'School_4', 'count':2}]

x1 = pd.DataFrame(d1)
x2 = pd.DataFrame(d2)
#just get empty DataFrame
print pd.merge(x1, x2)

如何获取聚合结果?
2个回答

7

您可以做以下事情:

>>> pd.concat([x1, x2]).groupby(["city", "school"], as_index=False)["count"].sum()
       city    school        count
0          LA  School_1      1
1          LA  School_3      1
2          LA  School_4      3
3    New York  School_3      1
4    New York  school_3      1
5  Washington  School_1      1
6  Washington  School_4      1
7  Washington  School_5      1

请注意,由于数据中出现了拼写错误(school_3School_3),纽约出现了两次。

5
这是与 @elyase 的解决方案略有不同的实现,使用了 pandas.DataFrame.merge(...)
x1.merge(x2,on=['city', 'school', 'count'], how='outer').groupby(['city', 'school'], as_index=False)['count'].sum()

ipython笔记本%timeit中计时,这种方法比@elyase的方法稍微快一点(<1ms)

100 loops, best of 3: 6.25 ms per loop  #using concat(...) with @elyase's solution
100 loops, best of 3: 5.49 ms per loop #using merge(...) in this solution

另外,如果您想使用 pandas aggregate 功能,您也可以这样做:

x1.merge(x2,on=['city', 'school', 'count'], how='outer').groupby(['city', 'school'], as_index=False).agg(numpy.sum)

唯一需要说明的是,使用agg(...)是三种方案中最慢的。
显然,这3种方法都提供了正确的结果:
         city    school  count
0          LA  School_1      1
1          LA  School_3      1
2          LA  School_4      3
3    New York  School_3      1
4    New York  school_3      1
5  Washington  School_1      1
6  Washington  School_4      1
7  Washington  School_5      1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接