GNU sort在macOS和Linux上产生不同的结果

Question

GNU sort在macOS和Linux上产生不同的结果

4

我正在尝试在MacOS Mavericks和Linux Centos 6.5上获得相同的GNU排序输出。我已经在MacOS上安装了最新的'brew' gsort。当对完全相同的文件进行排序时，两个平台的排序结果不同，特别是如何处理'#'字符。以下是排序后文件的前几行，您可以清楚地看到差异：

MacOS brew gsort：

SENT_ID1###de peu ||| and gustav stresemann ||| 1.0<br/>
SENT_ID1###en compagnie d' aristide briand ||| only just missed achieving ||| 1.0<br/>
SENT_ID1###et de gustav stresemann ||| their aim ||| 1.0<br/>
SENT_ID1###il a manqué cet objectif ||| he and aristide briand ||| 1.0<br/>
SENT_ID10###dans le même esprit ||| still ||| 1.0<br/>
SENT_ID10###de comblement ||| with the same aim of making good ||| 1.0<br/>
SENT_ID10###de nos institutions européennes ||| of institutional democracy ||| 1.0<br/>

在CentOS上使用'sort'命令的相同文件:

SENT_ID10000###car il constitue l' ||| as it constitutes ||| 1.0<br/>
SENT_ID10000###de ce débat ||| of this debate ||| 1.0<br/>
SENT_ID10000###nous pensons ||| we think ||| 1.0<br/>
SENT_ID10000###que ce paragraphe aurait mérité ||| that this section would have merited ||| 1.0<br/>
SENT_ID10000###un des défis majeurs ||| one of the major challenges ||| 1.0<br/>

SENT_ID10000###un plus ample développement ||| further development ||| 1.0<br/>
SENT_ID10001###à aucune règle si ce n' est celle du marché ||| only to market rules ||| 1.0<br/>
SENT_ID10001###ces systèmes complémentaires ||| these supplementary systems ||| 1.0<br/>
SENT_ID10001###en augmentation ||| which are increasing ||| 1.0<br/>
SENT_ID10001###ne sont soumis ||| are subject ||| 1.0<br/>

等等。

在CentOS中，零字符优先于“#”字符，因此可以看到排序顺序完全不同。我期望的MacOS排序顺序是gsort。有人能告诉我为什么CentOS排序顺序是错误的，以及如何纠正它吗？

- user2615484

你能在每个系统上运行 set | grep LC 并发布结果吗？ - John Zwinck

嗨 John，两个系统上都没有 - 我的 MacOS 和 CenTOS 上都没有设置 LC 环境。 - user2615484

1

查看两个系统的 locale 输出。 - user2845360

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Moles · Accepted Answer

我发现在 macOS 和 Linux 上，即使两个系统的 locale 看起来一样，行为也是不同的。

如果您使用 LC_COLLATE=C 或 LC_COLLATE=POSIX，它应该至少在这两个操作系统上保持一致。

macOS，fr_FR.UTF-8：

$ (export LC_COLLATE=fr_FR.UTF-8; echo -e 'a\nA\na.1\nA.1\na1\nA1\na#1\nA#1' | gsort)
A
A#1
A.1
A1
a
a#1
a.1
a1

注意大写字母在小写字母之前排序，# 在 . 之前排序，. 在 1 之前排序。 Linux (RH6), fr_FR.utf8:

$ (export LC_COLLATE=fr_FR.utf8; echo -e 'a\nA\na.1\nA.1\na1\nA1\na#1\nA#1' | sort)
a
A
a1
a.1
a#1
A1
A.1
A#1

注意大写字母会排在小写字母之后，1 会排在 . 之前，. 会排在 # 之前。

现在使用 POSIX：

macOS：

$ (export LC_COLLATE=POSIX; echo -e 'a\nA\na.1\nA.1\na1\nA1\na#1\nA#1' | gsort)
A
A#1
A.1
A1
a
a#1
a.1
a1

Linux：

$ (export LC_COLLATE=POSIX; echo -e 'a\nA\na.1\nA.1\na1\nA1\na#1\nA#1' | sort)
A
A#1
A.1
A1
a
a#1
a.1
a1

有趣的是，macOS默认排序似乎更接近于POSIX标准。我想知道背后的历史。