Pandas多级索引排序

Question

Pandas多级索引排序

3

在Pandas 0.19中，我有一个大型的数据框，其中包含以下形式的多重索引。

          C0     C1     C2
A   B
bar one   4      2      4
    two   1      3      2
foo one   9      7      1
    two   2      1      3

我希望根据“two”将bar和foo（以及许多其他类似的双行）进行排序，得到以下结果：

          C0     C1     C2
A   B
bar one   4      4      2
    two   1      2      3
foo one   7      9      1
    two   1      2      3

我对速度感兴趣（因为我有许多列和许多成对的行）。如果重新排列数据可以加快排序，我也很高兴。非常感谢。

- hoelder

2个回答

2

这里是一个主要使用numpy的解决方案，可以获得良好的性能。它首先选择仅包含'two'的行并进行argsort排序。然后为原始数据框的每一行设置此顺序。然后展开此顺序（在每一行中添加一个常数偏移量），并将其与原始数据框的值配对。然后根据这个展开、偏移和argsort排序的数组重新排序所有原始值，以创建一个新的数据框，并按照所需的排序顺序排列。

rows, cols = df.shape
df_a = np.argsort(df.xs('two', level=1))
order = df_a.reindex(df.index.droplevel(-1)).values
offset = np.arange(len(df)) * cols
order_final = order + offset[:, np.newaxis]
pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)

输出

         C0  C1  C2
A   B              
bar one   4   4   2
    two   1   2   3
foo one   7   9   1
    two   1   2   3

一些速度测试

# create much larger frame
import string
idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])

#scott boston
%timeit df1.groupby(level=0).apply(sortit)
10 loops, best of 3: 199 ms per loop

#Ted
1000 loops, best of 3: 5 ms per loop

- Ted Petrou

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Scott Boston · Accepted Answer

这里有一个方案，虽然有些笨拙：

输入数据框：

         C0  C1  C2
A   B              
bar one   4   2   4
    two   1   3   2
foo one   9   7   1
    two   2   1   3

自定义排序函数：

def sortit(x):
    xcolumns = x.columns.values
    x.index = x.index.droplevel()
    x.sort_values(by='two',axis=1,inplace=True)
    x.columns = xcolumns
    return x

df.groupby(level=0).apply(sortit)

输出：

         C0  C1  C2
A   B              
bar one   4   4   2
    two   1   2   3
foo one   7   9   1
    two   1   2   3