使用Linux命令"sort -f | uniq -i"一起忽略大小写

Question

使用Linux命令"sort -f | uniq -i"一起忽略大小写

18

我正在尝试在一个包含两列数据的列表中查找唯一和重复的数据。我只想比较第一列中的数据。

数据可能是这样的（由制表符分隔）：

What are you doing?     Che cosa stai facendo?
WHAT ARE YOU DOING?     Che diavolo stai facendo?
what are you doing?     Qual è il tuo problema amico?

我一直在尝试以下内容：

不忽略大小写排序（只使用"sort"，没有-f选项）可以减少重复项

gawk '{ FS = "\t" ; print $1 }' EN-IT_Corpus.txt | sort | uniq -i -D > dupes
忽略大小写排序（使用"sort -f"）会导致更多重复项

gawk '{ FS = "\t" ; print $1 }' EN-IT_Corpus.txt | sort -f | uniq -i -D > dupes

如果我想找到忽略大小写的重复项，#2是否更准确，因为它首先按照不区分大小写排序，然后根据排序后的数据查找重复项？

据我所知，我不能将sort和unique命令组合在一起，因为sort没有显示重复项的选项。

谢谢，史蒂夫

- Steve3p0

1

你希望从你的样本数据中得到什么输出？ - Jonathan Leffler

3个回答

9

我认为关键是对数据进行预处理：

file="EN-IT_Corpus.txt"
dups="dupes.$$"
sed 's/        .*//' $file | sort -f | uniq -i -D > $dups
fgrep -i -f $dups $file

sed命令只生成英文单词；这些单词不区分大小写排序，然后通过uniq进行不区分大小写的去重，并仅打印重复的条目。然后再次处理数据文件，使用fgrep或grep -F查找那些具有重复键的条目，在文件-f $dups中指定要查找的模式。显然（希望如此），sed命令中的大空格是一个制表符；根据您的shell和sed等工具，您可以编写\t。

事实上，使用GNU grep，您可以执行以下操作：

sed 's/        .*//' $file |
sort -f |
uniq -i -D |
fgrep -i -f - $file

如果重复的数量非常大，您可以通过以下方式压缩它们：

sed 's/        .*//' $file |
sort -f |
uniq -i -D |
sort -f -u |
fgrep -i -f - $file

考虑以下输入数据：

What a surprise?        Vous etes surpris?
What are you doing?        Che cosa stai facendo?
WHAT ARE YOU DOING?        Che diavolo stai facendo?
Provacation         Provacatore
what are you doing?        Qual è il tuo problema amico?
Ambiguous        Ambiguere

所有这些的输出为：

What are you doing?        Che cosa stai facendo?
WHAT ARE YOU DOING?        Che diavolo stai facendo?
what are you doing?        Qual è il tuo problema amico?

- Jonathan Leffler

5

或者这个：

独特的：

awk '!arr[tolower($1)]++'  inputfile > unique.txt

重复项

awk '{arr[tolower($1)]++; next} 
END{for (i in arr {if(arr[i]>1){print i, "count:", arr[i]}} }' inputfile > dup.txt

- jim mcnamara

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- stefansson · Accepted Answer

你可以保持简单：

sort -uf
#where sort -u = the unique findings
#      sort -f = insensitive case