使用 key 参数按多列对 Pandas 数据框进行排序

Question

使用 key 参数按多列对 Pandas 数据框进行排序

9

我有一个pandas dataframe，包含以下列：

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

我希望首先按照“two”列进行排序，然后再按照“one”列进行排序。对于第二个排序规则，我想使用自定义排序规则，将“one”列按字母字符[A-Z]和尾随数字[0-100]进行排序。因此，排序的结果应为：

one two
 A1   1
 B1   1
 A2   1
 A1   2
 B1   2
 A2   2

在使用类似以下排序规则之前，我已经按照“one”列对字符串列表进行了排序：

def custom_sort(value):
    return (value[0], int(value[1:]))

my_list.sort(key=custom_sort)

如果我尝试通过pandas的排序应用此规则，我会遇到许多问题，包括：

pandas的DataFrame.sort_values()函数接受一个关键字来进行排序，就像sort()函数一样，但关键字函数应该是矢量化的（根据pandas文档）。如果我只将排序关键字应用于列'one'，则会出现错误"TypeError: cannot convert the series to <class 'int'>"
当使用pandas的DataFrame.sort_values()方法时，它将对您传入的所有列应用排序关键字。这样做不起作用，因为我想首先按列'two'进行本地数值排序。

我该如何按照上述要求对DataFrame进行排序？

- user11058068

4个回答

1

其中一种解决方案是将两个列都设置为pd.Categorical，并将预期的顺序作为参数“categories”传递。

但我有一些要求，我不能强制转换未知/意外值，不幸的是这就是pd.Categorical正在做的。而且，None不支持作为类别并自动强制转换。

因此，我的解决方案是使用一个键按照自定义排序顺序对多个列进行排序：

import pandas as pd


df = pd.DataFrame([
    [A2, 2],
    [B1, 1],
    [A1, 2],
    [A2, 1],
    [B1, 2],
    [A1, 1]], 
  columns=['one','two'])


def custom_sorting(col: pd.Series) -> pd.Series:
    """Series is input and ordered series is expected as output"""
    to_ret = col
    # apply custom sorting only to column one:
    if col.name == "one":
        custom_dict = {}
        # for example ensure that A2 is first, pass items in sorted order here:
        def custom_sort(value):
            return (value[0], int(value[1:]))

        ordered_items = list(col.unique())
        ordered_items.sort(key=custom_sort)
        # apply custom order first:
        for index, item in enumerate(ordered_items):
            custom_dict[item] = index
        to_ret = col.map(custom_dict)
    # default text sorting is about to be applied
    return to_ret


# pass two columns to be sorted
df.sort_values(
    by=["two", "one"],
    ascending=True,
    inplace=True,
    key=custom_sorting,
)

print(df)

输出：

请注意，此解决方案可能会很慢。

- Ievgen

在我的情况下，使用pd.Categorical列的建议非常有用。谢谢。 - bli

1

使用 str.extract 创建一些临时列，这些列基于 1) 字母 (a-zA-Z]+) 和 2) 数字 (\d+)，然后删除它们。

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(\d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]: 
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2

- David Erickson

0

使用 pandas >= 1.1.0 和 natsort，现在你也可以这样做：

import natsort

sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())

- Akaisteph7

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alexander · Accepted Answer

你可以将列one拆分成其组成部分，将它们作为列添加到数据帧中，然后使用列two进行排序。最后，删除临时列。

>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
       .sort_values(['two', 'rhs', 'lhs'])
       .drop(columns=['lhs', 'rhs']))
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2
0  A2    2