在单词列表中匹配单词并计算出现次数。

Question

在单词列表中匹配单词并计算出现次数。

4

我有一个一般的文本文件，里面有一些文字，范围比较随机。同时，我也有一个单词列表，想要将其与文本文件中出现在词汇表中的每个单词进行比较，并计算出现次数。

例如，我的单词列表可能包含以下内容：

good
bad 
cupid
banana
apple

然后我想将这些单独的单词与我的文本文件进行比较，文本文件可能是这样的：

有时候我会去好的地方旅行，不会去坏的地方。例如，我想去天堂见到一个吃苹果的丘比特。也许我还会看到神话生物在吃其他水果，比如苹果、香蕉和其他好水果。

我希望输出结果显示列表中每个单词出现的次数。我有一种方法可以使用 awk 和 for-loop 来实现，但我真的不想使用 for-loop，因为我的真实单词列表大约有 10000 个单词。

所以在这种情况下，我的输出应该是（我认为）9，因为它计算了列表中单词的总出现次数。

顺便说一下，这段话完全是随机的。

- CrudeCoder

4个回答

2

一个 Awk 解决方案：

awk -f cnt.awk words.txt input.txt

其中 cnt.awk 是:

FNR==NR {
    word[$1]=0
    next
}
{
    str=str $0 RS
}
END{
    for (i in word) {
        stri=str
        while(match(stri,i)) {
           stri=substr(stri,RSTART+RLENGTH)
           word[i]++
        }
    }
    for (i in word)
        print i, word[i]
}

- Håkon Hægland

2

如果您不需要详细报告，那么这是@hek2mgl答案的更快版本：

IF

while read word; do
    grep -o $word input.txt
done < words.txt | wc -l

如果您需要详细的报告，这里有另一个版本：

while read word; do
    grep -o "$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

最后，如果您想匹配整个单词，则需要在 grep 中使用更严格的模式：

while read word; do
    grep -o "\<$word\>" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

然而，这种方式下，模式banana将无法匹配文本中的bananas。如果你想让banana匹配bananas，你可以将模式匹配单词开头，像这样：

while read word; do
    grep -o "\<$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

我不确定如果我们同时使用多个单词调用grep是否会更快：

paste -d'|' - - - < words.txt | sed -e 's/ //g' -e 's/\|*$//' | while read words; do
    grep -oE "\<($words)\>" input.txt
done

这将每次使用grep查找 3 个单词。您可以尝试添加更多的-让paste一次匹配更多的单词，例如：

paste -d'|' - - - - - - - - - - < words.txt | ...

无论如何，我想知道哪个解决方案是最快的，这个还是@HakonHægland提供的awk解决方案。

- janos

你好 :) 如何为单词列表中的每个单词输出详细报告？ - hek2mgl

你说得对，它确实没有。但他需要详细的报告吗？ - janos

是的，似乎是这样。我对我的答案有一个很大的问题：例如，即使单词是badest，它也会匹配bad。它实际上并不匹配整个单词，而是匹配模式。 - hek2mgl

我注意到了，我正在修复它;-) - janos

\< 是单词开头的模式，\> 是单词结尾的模式；-) - janos

显示剩余3条评论

2

对于任何较长的文本，我肯定会使用这个：

perl -nE'BEGIN{open my$fh,"<",shift;my@a=map lc,map/(\w+)/g,<$fh>;@h{@a}=(0)x@a;close$fh}exists$h{$_}and$h{$_}++for map lc,/(\w+)/g}{for(keys%h){say"$_: $h{$_}";$s+=$h{$_}}say"Total: $s"' word.list input.txt

- Hynek -Pichi- Vychodil

有没有办法将这些数字的总和相加在一起？ - CrudeCoder

我使用grep和awk将所有数字相加...我想知道你是否可以用perl命令在一行中实现它。我目前有这个：

perl -nE'BEGIN{open my$fh,"<",shift;my@a=<$fh>;chomp@a;@h{@a}=(0)x@a;close$fh}exists$h{$_}and$h{$_}++for/(\w+)/g}{say"$_: $h{$_}"for keys%h' word1.list word2.list | grep -o '[0-9]*' | awk '{ sum += $1 } END { print sum }' > temp.txt

- CrudeCoder

你好，是否有办法在你的代码中忽略大小写呢？就像大写字母和大小写组合不重要一样？ - CrudeCoder

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hek2mgl · Accepted Answer

对于小到中等长度的文本，您可以使用 grep 与 wc 结合使用：

cat <<EOF > word.list
good
bad 
cupid
banana
apple
EOF

cat <<EOF > input.txt
Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.
EOF

while read search ; do
    echo "$search: $(grep -o $search input.txt | wc -l)" 
done < word.list | awk '{total += $2; print}END{printf "total: %s\n", total}'

输出：

good: 3
bad: 2
cupid: 1
banan: 1
apple: 2
total: 9