在Linux终端中比较两个文件

Question

在Linux终端中比较两个文件

linuxterminaldifffile-comparison

201

有两个文件名分别为"a.txt"和"b.txt"，它们都有一个单词列表。现在我想检查在"a.txt"中哪些单词是多余的且不在"b.txt"中。

我需要一种高效的算法，因为我需要比较两个字典。

- Ali Imran

37

仅使用 diff a.txt b.txt 不够吗？ - ThanksForAllTheFish

每个文件中的单词可以出现多次吗？您能对文件进行排序吗？ - Basile Starynkevitch

我需要的只是那些在“b.txt”中不存在且在“a.txt”中存在的单词。 - Ali Imran

12个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ali Imran · Answer 1

这是我的解决方案：

mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english

- James Brown · Answer 2

使用 awk 实现。测试文件：

$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one

awk：

$ awk '
NR==FNR {                    # process b.txt  or the first file
    seen[$0]                 # hash words to hash seen
    next                     # next word in b.txt
}                            # process a.txt  or all files after the first
!($0 in seen)' b.txt a.txt   # if word is not hashed to seen, output it

输出重复项：

four
four

为了避免重复，将每个新遇到的单词添加到 a.txt 中的 seen 哈希表中：

$ awk '
NR==FNR {
    seen[$0]
    next
}
!($0 in seen) {              # if word is not hashed to seen
    seen[$0]                 # hash unseen a.txt words to seen to avoid duplicates 
    print                    # and output it
}' b.txt a.txt

输出：

four

如果单词列表是以逗号分隔的，如下所示：

$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three

你需要多跑几圈（for循环）：

awk -F, '                    # comma-separated input
NR==FNR {
    for(i=1;i<=NF;i++)       # loop all comma-separated fields
        seen[$i]
    next
}
{
    for(i=1;i<=NF;i++)
        if(!($i in seen)) {
             seen[$i]        # this time we buffer output (below):
             buffer=buffer (buffer==""?"":",") $i
        }
    if(buffer!="") {         # output unempty buffers after each record in a.txt
        print buffer
        buffer=""
    }
}' b.txt a.txt

这次的输出：

four
five,six