自然排序列表，将字母数字混合值移动到末尾

Question

自然排序列表，将字母数字混合值移动到末尾

4

我有一个字符串列表，我想进行自然排序：

c = ['0', '1', '10', '11', '2', '2Y', '3', '3Y', '4', '4Y', '5', '5Y', '6', '7', '8', '9', '9Y']

除了自然排序外，我希望将所有非纯数字字符串的条目移动到末尾。我期望的输出结果是这样的：

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '2Y', '3Y', '4Y', '5Y', '9Y']

需要注意的是，所有内容都必须进行自然排序，甚至是包含字母数字混合的字符串。

我知道我可以使用natsort包来实现我的需求，但这不足以满足我的要求。我需要进行两次排序调用——一次进行自然排序，另一次将非纯数字字符串移到末尾。

import natsort as ns
r = sorted(ns.natsorted(c), key=lambda x: not x.isdigit())

print(r)
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '2Y', '3Y', '4Y', '5Y', '9Y']

我想知道是否可以巧妙地使用natsort并将其减少到单个排序调用。

- cs95

1

为什么要担心进行第二次排序？TimSort被优化用来对包含已排序子序列的序列进行排序，因此第二次排序将非常快。而且您可以通过消除lambda表达式来提高性能：key=str.isdigit, reverse=True。 - PM 2Ring

@PM2Ring 如果你能详细解释一下，并将其转化为一个答案，我会很感激的！ - cs95

3个回答

2

我很感谢Willem Van Onsem发布了他的答案。但是，我应该在这里指出，原始函数的性能快了一个数量级。考虑到PM2 Ring的建议，以下是两种方法之间的一些基准测试: 设置

c = \
['0',
 '1',
 '10',
 '11',
 '2',
 '2Y',
 '3',
 '3Y',
 '4',
 '4Y',
 '5',
 '5Y',
 '6',
 '7',
 '8',
 '9',
 '9Y']
d = c * (1000000 // len(c) + 1)  # approximately 1M elements

Willem的解决方案

%timeit sorted(d, key=lambda x: (not x.isdigit(), ns.natsort_key(x)))
1 loop, best of 3: 2.78 s per loop

Original (w/ PM 2Ring's enhancement)

%timeit sorted(ns.natsorted(d), key=str.isdigit, reverse=True)
1 loop, best of 3: 796 ms per loop

原始版本性能高的解释是因为Tim Sort似乎对近乎有序的列表高度优化。

合理性检查

x = sorted(d, key=lambda x: (not x.isdigit(), ns.natsort_key(x)))
y = sorted(ns.natsorted(d), key=str.isdigit, reverse=True)

all(i == j for i, j in zip(x, y))
True

- cs95

2

您可以使用 natsorted 和正确的 key 选项来实现这一点。

>>> ns.natsorted(d, key=lambda x: (not x.isdigit(), x))
['0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '2Y',
 '3Y',
 '4Y',
 '5Y',
 '9Y']

该函数返回一个元组，其中原始输入作为第二个元素。数字字符串被放在前面，其他所有字符串被放在后面，然后子集被单独排序。

顺便提一下，Willem Van Onsem's solution 使用了已经弃用的natsort_key，自 natsort 版本3.0.4 起（如果您在解释器中打开了DeprecationWarning，则会看到这一点，并且该函数现在未记录在文档中）。它实际上效率相当低下...最好使用natort_keygen，它返回一个自然排序关键字。 natsort_key 在幕后调用此函数，因此对于每个输入，您都要创建一个新函数，然后调用它一次。

下面我重复了这里显示的测试，并添加了我的解决方案，使用natsorted方法以及使用natsort_keygen而不是natsort_key的其他解决方案的时间。

In [13]: %timeit sorted(d, key=lambda x: (not x.isdigit(), ns.natsort_key(x)))
1 loop, best of 3: 33.3 s per loop

In [14]: natsort_key = ns.natsort_keygen()

In [15]: %timeit sorted(d, key=lambda x: (not x.isdigit(), natsort_key(x)))
1 loop, best of 3: 11.2 s per loop

In [16]: %timeit sorted(ns.natsorted(d), key=str.isdigit, reverse=True)
1 loop, best of 3: 9.77 s per loop

In [17]: %timeit ns.natsorted(d, key=lambda x: (not x.isdigit(), x))
1 loop, best of 3: 23.8 s per loop

- SethMMorton

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Willem Van Onsem · Accepted Answer

natsort 提供了一个函数 natsort_key，可以将项目转换为基于其排序的元组。

因此，您可以这样使用它：

sorted(c, key=lambda x: <b>(</b>not x.isdigit()<b>, *ns.natsort_key(x))</b>)

这会产生以下结果：

>>> sorted(c, key=lambda x: (not x.isdigit(), *ns.natsort_key(x)))
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '2Y', '3Y', '4Y', '5Y', '9Y']

即使不使用可迭代对象解包，也可以使用它。因为在这种情况下，我们有两个2元组，在第一项上打平时，将比较natsort_key调用的结果以解决平局:

sorted(c, key=lambda x: (not x.isdigit(), ns.natsort_key(x)))