通过使用
sort
对文件进行排序,然后再应用
uniq
,可以去重。
看起来文件已经被很好地排序了:
$ cat test.csv
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
$ sort test.csv | uniq
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
overflow@domain2.example,2009-11-27 00:58:29.793000000,xx3.net,255.255.255.0
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
你也可以使用一些AWK的魔法:
$ awk -F, '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' test.csv
stack2@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack4@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
stack3@domain.example,2009-11-27 01:05:47.893000000,xx2.net,127.0.0.1
overflow@domain2.example,2009-11-27 00:58:29.646465785,2x3.net,256.255.255.0
man sort
)中有解释。它代表排序的起始和结束位置。 - Serranosort
的手册所说的:“-u
,--unique
与-c
一起使用时,检查严格排序;没有-c
,仅输出相等运行的第一个。”因此,确实是“在排序之前重复的第一个出现”。 - Geremia