Pandas透视表:列顺序和小计

5

我将使用Pandas 0.19进行翻译。

考虑以下数据框:

FID  admin0  admin1  admin2  windspeed  population
0    cntry1  state1  city1   60km/h     700
1    cntry1  state1  city1   90km/h     210
2    cntry1  state1  city2   60km/h     100
3    cntry1  state2  city3   60km/h     70
4    cntry1  state2  city4   60km/h     180
5    cntry1  state2  city4   90km/h     370
6    cntry2  state3  city5   60km/h     890
7    cntry2  state3  city6   60km/h     120
8    cntry2  state3  city6   90km/h     420
9    cntry2  state3  city6   120km/h    360
10   cntry2  state4  city7   60km/h     740

我该如何创建这样的表格?
                                population
                         60km/h  90km/h  120km/h
admin0  admin1  admin2  
cntry1  state1  city1    700     210      0
cntry1  state1  city2    100     0        0
cntry1  state2  city3    70      0        0
cntry1  state2  city4    180     370      0
cntry2  state3  city5    890     0        0
cntry2  state3  city6    120     420      360
cntry2  state4  city7    740     0        0

我尝试使用以下数据透视表:

table = pd.pivot_table(df,index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"],fill_value=0)

通常情况下它运行得很好,但不幸的是我无法正确排序新列:120km/h列出现在60km/h和90km/h之前。我如何指定新列的顺序?
此外,第二步我需要为admin0和admin1添加小计。理想情况下,我需要的表格应该像这样:
                                population
                         60km/h  90km/h  120km/h
admin0  admin1  admin2  
cntry1  state1  city1    700     210      0
cntry1  state1  city2    100     0        0
        SUM state1       800     210      0
cntry1  state2  city3    70      0        0
cntry1  state2  city4    180     370      0
        SUM state2       250     370      0
SUM cntry1               1050    580      0
cntry2  state3  city5    890     0        0
cntry2  state3  city6    120     420      360
        SUM state3       1010    420      360
cntry2  state4  city7    740     0        0
        SUM state4       740     0        0
SUM cntry2               1750    420      360
SUM ALL                  2800    1000    360
2个回答

5

您可以使用 reindex() 方法和自定义排序来完成操作:

In [26]: table
Out[26]:
                     population
windspeed               120km/h 60km/h 90km/h
admin0 admin1 admin2
cntry1 state1 city1           0    700    210
              city2           0    100      0
       state2 city3           0     70      0
              city4           0    180    370
cntry2 state3 city5           0    890      0
              city6         360    120    420
       state4 city7           0    740      0

In [27]: cols = sorted(table.columns.tolist(), key=lambda x: int(x[1].replace('km/h','')))

In [28]: cols
Out[28]: [('population', '60km/h'), ('population', '90km/h'), ('population', '120km/h')]

In [29]: table = table.reindex(columns=cols)

In [30]: table
Out[30]:
                     population
windspeed                60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1         700    210       0
              city2         100      0       0
       state2 city3          70      0       0
              city4         180    370       0
cntry2 state3 city5         890      0       0
              city6         120    420     360
       state4 city7         740      0       0

这完美地回答了第一个问题,谢谢! 有关如何添加小计的想法吗? - Andreampa

4

使用小计和 MultiIndex.from_arrays 的解决方案。最后 concat 所有 Dataframessort_index 并添加所有 sum

#replace km/h and convert to int
df.windspeed = df.windspeed.str.replace('km/h','').astype(int)
print (df)
    FID  admin0  admin1 admin2  windspeed  population
0     0  cntry1  state1  city1         60         700
1     1  cntry1  state1  city1         90         210
2     2  cntry1  state1  city2         60         100
3     3  cntry1  state2  city3         60          70
4     4  cntry1  state2  city4         60         180
5     5  cntry1  state2  city4         90         370
6     6  cntry2  state3  city5         60         890
7     7  cntry2  state3  city6         60         120
8     8  cntry2  state3  city6         90         420
9     9  cntry2  state3  city6        120         360
10   10  cntry2  state4  city7         60         740

#pivoting
table = pd.pivot_table(df,
                       index=["admin0","admin1","admin2"], 
                       columns=["windspeed"], 
                       values=["population"],
                       fill_value=0)
print (table)
                    population          
windspeed                   60   90   120
admin0 admin1 admin2                     
cntry1 state1 city1         700  210    0
              city2         100    0    0
       state2 city3          70    0    0
              city4         180  370    0
cntry2 state3 city5         890    0    0
              city6         120  420  360
       state4 city7         740    0    0

#groupby and create sum dataframe by levels 0,1
df1 = table.groupby(level=[0,1]).sum()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0), 
                                       df1.index.get_level_values(1)+ '_sum', 
                                       len(df1.index) * ['']])
print (df1)
                   population          
windspeed                 60   90   120
admin0                                 
cntry1 state1_sum         800  210    0
       state2_sum         250  370    0
cntry2 state3_sum        1010  420  360
       state4_sum         740    0    0

df2 = table.groupby(level=0).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.values + '_sum',
                                       len(df2.index) * [''], 
                                       len(df2.index) * ['']])
print (df2)
             population          
windspeed           60   90   120
cntry1_sum         1050  580    0
cntry2_sum         1750  420  360

#concat all dataframes together, sort index
df = pd.concat([table, df1, df2]).sort_index(level=[0])

#add km/h to second level in columns
df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0),
                                       df.columns.get_level_values(1).astype(str) + 'km/h'])

#add all sum
df.loc[('All_sum','','')] = table.sum().values
print (df)
                             population               
                                 60km/h 90km/h 120km/h
admin0     admin1     admin2                          
cntry1     state1     city1         700    210       0
                      city2         100      0       0
           state1_sum               800    210       0
           state2     city3          70      0       0
                      city4         180    370       0
           state2_sum               250    370       0
cntry1_sum                         1050    580       0
cntry2     state3     city5         890      0       0
                      city6         120    420     360
           state3_sum              1010    420     360
           state4     city7         740      0       0
           state4_sum               740      0       0
cntry2_sum                         1750    420     360
All_sum                            2800   1000     360

评论编辑:

def f(x):
    print (x)
    if (len(x) > 1):
        return x.sum()

df1 = table.groupby(level=[0,1]).apply(f).dropna(how='all')
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0), 
                                       df1.index.get_level_values(1)+ '_sum', 
                                       len(df1.index) * ['']])
print (df1)
                   population              
windspeed                 60     90     120
admin0                                     
cntry1 state1_sum       800.0  210.0    0.0
       state2_sum       250.0  370.0    0.0
cntry2 state3_sum      1010.0  420.0  360.0

与MultiIndex一起工作并不容易,不幸的是。如果我更改索引名称 - 在末尾添加“_sum”,因为索引排序而不是问题? - jezrael
非常好,谢谢。 我注意到这个脚本在某些情况下也会创建总和行,即使对于特定的admin1(例如给定示例中的state4_sum)只有一条记录。这样做会给表格带来混乱,因为在我的情况下有很多像这样的admin1。是否有办法只在实际需要求和时才添加总和行? - Andreampa
是的,这是可能的。但有一个问题 - 数据框中是否有NaN值? - jezrael
不,有0值但没有NaN值。 - Andreampa
超棒的,看一下我回答中的最后一个编辑 - 数据按长度过滤然后求和。当长度为==1时,你会得到NaN行,所以需要使用dropna - jezrael

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接