使用自定义比较函数按多列对pandas数据框进行排序

Question

使用自定义比较函数按多列对pandas数据框进行排序

4

我希望能够按照多个列对pandas数据框进行排序，其中对于“col2”和“col3”中的某些列，我想要使用自定义的比较函数来进行排序，该函数接受两个元素：

例如：

>>> df = pd.DataFrame({"col1": [1,2,3], "col2": [[2], [], [1]], "col3": [[1,0,1], [2,2,2], [3]]})

>>> df
   col1 col2       col3
0     1  [2]  [1, 0, 1]
1     2   []  [2, 2, 2]
2     3  [1]        [3]

def compare_fn(l1, l2): #list 1 and list 2 
    if len(l1) < len(l2):
        return -1 # l1 is of smaller value than l2
    if len(l1) > len(l2):
        return 1 # l2 is of smaller value than l1
    else:
        for i in range(len(l1)):
            if l1[i] < l2[i]:
                return -1
            elif l1[i] > l2[i]:
                return 1
    return 0 # l1 and l2 have same value

现在，我想按照所有三列进行排序，在col2和col3中，我想使用的比较两个元素的函数是我自定义的函数。（对于col1，它是一个简单的排序）。

我尝试过：

df.sort_values(["col1", "col2", "col3"], key=[None, compare_fn, compare_fn])，这会返回'list' object is not callable错误。

from functools import cmp_to_key; df.sort_values(["col1", "col2", "col3"], key=[None, cmp_to_key(compare_fn), cmp_to_key(compare_fn)])，这会返回'list' object is not callable错误。

我甚至试图完全忽略第一列，并将一个参数传递给关键字：

df[["col2", "col3"]].sort_values(["col2", "col3"], key=cmp_to_key(compare_fn)) 返回TypeError: object of type 'functools.KeyWrapper' has no len()

和

df[["col2", "col3"]].sort_values(["col2", "col3"], key=compare_fn) 返回TypeError: compare_fn() missing 1 required positional argument: 'l2'。

所以我知道我至少有一个问题，那就是不知道如何使用双元素比较函数来对pandas DataFrame列进行排序。

- WalksB

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Quang Hoang · Accepted Answer

您的关键函数需要将整个系列作为输入。

请按照以下方式重写您的函数：

def compare_fn(l): #list 1 and list 2 
    return [(len(x), tuple(x)) for x in l]

(df.sort_values('col1')
   .sort_values(['col2','col3'], 
                key=compare_fn, kind='mergesort')
)

输出：

   col1 col2       col3
1     2   []  [2, 2, 2]
2     3  [1]        [3]
0     1  [2]  [1, 0, 1]

更新我们还可以重写这个函数，使其适用于其他列：

def compare_fn(l): #list 1 and list 2 
    return ([(len(x), tuple(x)) for x in l]
                if type(l[0]) == list       # case list
                else l                      # case integer
           )

df.sort_values(['col1','col2','col3'], key=compare_fn)