如何使用自定义键函数对Pandas多级MultiIndex进行排序?

3

假设我有一个带有如下多级索引的数据框:

import pandas as pd
import numpy as np

my_index = pd.MultiIndex.from_product(
    [(3,1,2), ("small", "tall", "medium"), ("B", "A", "C")],
    names=["number", "size", "letter"]
)

df_0 = pd.DataFrame(np.random.rand(27, 2), columns=["x", "y"], index=my_index)

                             x         y
number size   letter
3      small  B       0.950073  0.599918
              A       0.014450  0.472736
              C       0.208064  0.778538
       tall   B       0.979631  0.367234
              A       0.832459  0.449875
              C       0.761929  0.053144
       medium B       0.460764  0.800131
              A       0.355746  0.573813
              C       0.078924  0.058865
1      small  B       0.405209  0.354636
              A       0.536242  0.012904
              C       0.458910  0.723627
       tall   B       0.859898  0.442954
              A       0.109729  0.885598
              C       0.378363  0.220695
       medium B       0.652191  0.685181
              A       0.503525  0.400973
              C       0.454671  0.188798
2      small  B       0.407654  0.168782
              A       0.393451  0.083023
              C       0.073432  0.165209
       tall   B       0.678226  0.108497
              A       0.718348  0.077935
              C       0.595500  0.146271
       medium B       0.719985  0.422167
              A       0.950950  0.532390
              C       0.687721  0.920229

现在我想按不同级别排序索引,首先是数字,然后是大小,最后是字母。
如果我这样做...
df_1 = df_0.sort_index(level=["number", "size", "letter"], inplace=False)

... 课程大小按字母顺序排序。

                             x         y
number size   letter
1      medium A       0.503525  0.400973
              B       0.652191  0.685181
              C       0.454671  0.188798
       small  A       0.536242  0.012904
              B       0.405209  0.354636
              C       0.458910  0.723627
       tall   A       0.109729  0.885598
              B       0.859898  0.442954
              C       0.378363  0.220695
2      medium A       0.950950  0.532390
              B       0.719985  0.422167
              C       0.687721  0.920229
       small  A       0.393451  0.083023
              B       0.407654  0.168782
              C       0.073432  0.165209
       tall   A       0.718348  0.077935
              B       0.678226  0.108497
              C       0.595500  0.146271
3      medium A       0.355746  0.573813
              B       0.460764  0.800131
              C       0.078924  0.058865
       small  A       0.014450  0.472736
              B       0.950073  0.599918
              C       0.208064  0.778538
       tall   A       0.832459  0.449875
              B       0.979631  0.367234
              C       0.761929  0.053144

但我希望可以按照自定义键排序。 我知道可以通过自定义排序函数来对尺寸级别进行排序,代码如下:

custom_key = np.vectorize(lambda x: {"small": 0, "medium": 1, "tall": 2}[x])

df_2 = df_0.sort_index(level=1, key=custom_key, inplace=False)

                             x         y
number size   letter
1      small  A       0.536242  0.012904
              B       0.405209  0.354636
              C       0.458910  0.723627
2      small  A       0.393451  0.083023
              B       0.407654  0.168782
              C       0.073432  0.165209
3      small  A       0.014450  0.472736
              B       0.950073  0.599918
              C       0.208064  0.778538
1      medium A       0.503525  0.400973
              B       0.652191  0.685181
              C       0.454671  0.188798
2      medium A       0.950950  0.532390
              B       0.719985  0.422167
              C       0.687721  0.920229
3      medium A       0.355746  0.573813
              B       0.460764  0.800131
              C       0.078924  0.058865
1      tall   A       0.109729  0.885598
              B       0.859898  0.442954
              C       0.378363  0.220695
2      tall   A       0.718348  0.077935
              B       0.678226  0.108497
              C       0.595500  0.146271
3      tall   A       0.832459  0.449875
              B       0.979631  0.367234
              C       0.761929  0.053144

但是我怎么能像对待 df_1 一样按所有层级排序,并在第二个级别上使用自定义键呢?

                             x         y
number size   letter
1      small  A       0.536242  0.012904
              B       0.405209  0.354636
              C       0.458910  0.723627
       medium A       0.503525  0.400973
              B       0.652191  0.685181
              C       0.454671  0.188798
       tall   A       0.109729  0.885598
              B       0.859898  0.442954
              C       0.378363  0.220695
2      small  A       0.393451  0.083023
              B       0.407654  0.168782
              C       0.073432  0.165209
       medium A       0.950950  0.532390
              B       0.719985  0.422167
              C       0.687721  0.920229
       tall   A       0.718348  0.077935
              B       0.678226  0.108497
              C       0.595500  0.146271
3      small  A       0.014450  0.472736
              B       0.950073  0.599918
              C       0.208064  0.778538
       medium A       0.355746  0.573813
              B       0.460764  0.800131
              C       0.078924  0.058865
       tall   A       0.832459  0.449875
              B       0.979631  0.367234
              C       0.761929  0.053144

我该如何定义自定义键函数,以便我也可以通过名称在sort_index中访问级别?
df_3 = df_0.sort_index(level="size", key=custom_key, inplace=False)

这里会产生一个 KeyError 错误:'找不到级别大小'

2个回答

3

使用pandas.CategoricalIndex的解决方案:

categories_order = ['small', 'medium', 'tall']
categories = pd.CategoricalIndex(df_0.index.levels[1].values,
                                 categories=categories_order,
                                 ordered=True)

df_0.index = df_0.index.set_levels(categories, level=1)

out = df_0.sort_index()

# 输出:

print(out)

                             x         y
number size   letter                    
1      small  A       0.994537  0.984409
              B       0.153568  0.529762
              C       0.025857  0.750508
       medium A       0.977558  0.488506
              B       0.727121  0.189481
              C       0.465571  0.606248
       tall   A       0.638590  0.530700
              B       0.181290  0.963911
              C       0.068123  0.911834
2      small  A       0.183484  0.459770
              B       0.587073  0.935522
              C       0.068233  0.135137
       medium A       0.315199  0.047945
              B       0.568500  0.756994
              C       0.581868  0.725463
       tall   A       0.881304  0.739188
              B       0.566962  0.855521
              C       0.677635  0.851877
3      small  A       0.497159  0.804575
              B       0.290587  0.104613
              C       0.343545  0.468724
       medium A       0.868754  0.526125
              B       0.759151  0.439089
              C       0.515183  0.266913
       tall   A       0.486962  0.178156
              B       0.902421  0.133138
              C       0.613683  0.827321

我已经为你的答案点赞了,但是你认为使用index.levels[1].valuesindex.get_level_values(1)更好吗? - william_grisaitis

2
理想情况下,应使用有序的Categorical数据。

否则,可以基于级别名称使用自定义映射器:

# define here custom sorters
# all other levels will be sorted by default order
order = {'size': ['small', 'medium', 'tall']}

def sorter(s):
    if s.name in order:
        return s.map({k:v for v,k in enumerate(order[s.name])})
    return s

out = df_0.sort_index(level=["number", "size", "letter"], key=sorter)

输出:

                             x         y
number size   letter                    
1      small  A       0.530753  0.687982
              B       0.722848  0.974920
              C       0.174058  0.695016
       medium A       0.397016  0.550404
              B       0.426989  0.843007
              C       0.929218  0.497728
       tall   A       0.159078  0.005675
              B       0.917871  0.384265
              C       0.685435  0.585242
2      small  A       0.423254  0.838356
              B       0.342158  0.209632
...

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接