如何让pandas的groupby命令返回DataFrame而不是Series？

Question

如何让pandas的groupby命令返回DataFrame而不是Series？

7

我不理解pandas的groupby输出。我从一个有5个字段/列（zip、city、location、population和state）的DataFrame（df0）开始。

 >>> df0.info()
 <class 'pandas.core.frame.DataFrame'>
 RangeIndex: 29467 entries, 0 to 29466
 Data columns (total 5 columns):
 zip      29467 non-null object
 city     29467 non-null object
 loc      29467 non-null object
 pop      29467 non-null int64
 state    29467 non-null object
 dtypes: int64(1), object(4)
 memory usage: 1.1+ MB

我想获取每个城市的总人口数，但由于有多个邮政编码属于同一城市，因此我打算使用groupby.sum函数，如下所示：

  df6 = df0.groupby(['city','state'])['pop'].sum()

然而，这返回的是一个序列而不是数据帧：

 >>> df6.info()
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2672, in __getattr__
     return object.__getattribute__(self, name)
  AttributeError: 'Series' object has no attribute 'info'
 >>> type(df6)
 <class 'pandas.core.series.Series'>

我希望能够通过类似的方法查询任何城市的人口数量：

 df0[df0['city'].isin(['ALBANY'])]

但由于我拥有的是一个Series而不是DataFrame，因此我无法这样做。我也无法强制将其转换为DataFrame。

现在我想知道的是：

为什么我没有得到一个DataFrame而是一个Series？
如何获得一个表格，让我可以查找城市的人口？我可以使用从groupby得到的系列吗，还是我应该采用不同的方法？

- user1245262

1

使用as_index参数 - df0.groupby(['city','state'], as_index=False)['pop'].sum() - Zero

pandas 真是太不直观了 :( 刚刚遇到了同样的问题 - kev

2个回答

1

没有样本数据很难确定，但是根据您展示的代码返回一个Series，您应该能够通过使用类似于df6.loc['Albany', 'NY']的方式（也就是通过城市和州对分组的Series进行索引）来访问城市的人口。

之所以会得到Series，是因为您选择了单个列（'pop'）来应用组计算。如果您将组计算应用于列的列表，则会得到DataFrame。您可以通过执行df6 = df0.groupby(['city','state'])[['pop']].sum()来实现这一点。（请注意，在'pop'周围添加额外的括号，以选择一个列的列表而不是单个列。）但是，如果您可以使用上述方法访问城市数据，我不确定是否有必要这样做。

- BrenBarn

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

在groupby或者reset_index中需添加参数as_index=False，以将MultiIndex转换为列：

df6 = df0.groupby(['city','state'], as_index=False)['pop'].sum()

或者：

df6 = df0.groupby(['city','state'])['pop'].sum().reset_index()

示例:

df0 = pd.DataFrame({'city':['a','a','b'],
                   'state':['t','t','n'],
                   'pop':[7,8,9]})

print (df0)
  city  pop state
0    a    7     t
1    a    8     t
2    b    9     n

df6 = df0.groupby(['city','state'], as_index=False)['pop'].sum()
print (df6)
  city state  pop
0    a     t   15
1    b     n    9

df6 = df0.groupby(['city','state'])['pop'].sum().reset_index()
print (df6)
  city state  pop
0    a     t   15
1    b     n    9

使用loc进行最后的选择，如果要添加一个标量，则需使用item()函数：

print (df6.loc[df6.state == 't', 'pop'])
0    15
Name: pop, dtype: int64

print (df6.loc[df6.state == 't', 'pop'].item())
15

但如果只需要查找表，可以使用带有MultiIndex的Series：

s = df0.groupby(['city','state'])['pop'].sum()
print (s)
city  state
a     t        15
b     n         9
Name: pop, dtype: int64

#select all cities by : and state by string like 't'
#output is Series of len 1
print (s.loc[:, 't'])
city
a    15
Name: pop, dtype: int64

#if need output as scalar add item()
print (s.loc[:, 't'].item())
15