如何在文本文件中查找多个单词的数量？

Question

如何在文本文件中查找多个单词的数量？

8

我可以找到文本文件中单词出现的次数，就像在Linux中我们可以使用:

cat filename|grep -c tom

我的问题是，在文本文件中如何找到“Tom”和“Joe”等多个单词的数量。

- Rakesh

grep 统计的是行数，而不是单词数。一行中如果有 tomtom 这个词，算作一行还是两行？ - tchrist

你到底想要什么？每个指定单词一个计数？所有指定单词的计数总和？"word"是什么意思 - 正如tchrist已经提到的，你的示例计算与正则表达式匹配的行数，而不是单词数。 - GreyCat

9个回答

3

由于您有几个名称，正则表达式是解决此问题的方法。起初我认为只需在joe或tom的正则表达式上进行grep计数即可，但发现这没有考虑到tom和joe在同一行（或两个tom）的情况。

test.txt：

tom is really really cool!  joe for the win!
tom is actually lame.


$ grep -c '\<\(tom\|joe\)\>' test.txt
2

从test.txt文件可以看出，2是错误的答案，因此我们需要考虑同一行上有多个名字的情况。

然后我使用grep -o来仅显示与模式匹配的行的匹配部分，其中在文件中给出了tom或joe的正确模式匹配。然后将结果通过管道传递到wc以获取行数。

$ grep -o '\(joe\|tom\)' test.txt|wc -l
       3

3...正确答案！希望这能帮到你。

- Travis Nelson

我稍微修改了正则表达式以处理TomTom的情况。不错的测试用例...感谢指出。 - Travis Nelson

真正困难的测试案例将涉及原始单词上的重叠匹配。例如，如果您想要计算数量的单词是 cure、core、rely、lysis、island、land 和 dish，那么在诸如 insecurely 和 outlandish 的词中，您将获得 2 次命中，并在诸如 islandish 和 corelysis 的词中获得 3 次命中。一个天真的方法只会将它们各自计算为一次。这对于一个正则表达式来说并不好玩，但使用 N 个正则表达式，每个单词一个，就很容易了。 - tchrist

1

使用 awk 命令：

{for (i=1;i<=NF;i++)
    count[$i]++
}
END {
    for (i in count)
        print count[i], i
}

这将为输入产生完整的单词频率计数。将输出导入grep以获取所需字段。

awk -f w.awk input | grep -E 'tom|joe'

顺便提一下，在你的例子中不需要使用cat。大多数作为过滤器的程序都可以将文件名作为参数，因此最好使用这种方式。

grep -c tom filename

如果不这样做，很有可能人们会开始向你扔无用的猫奖；-)

- Fredrik Pihl

大多数充当过滤器的程序可以把文件名作为一个参数，即使它们不支持也可以使用输入重定向（例如grep -c tom< filename）。 - Jan Hudec

grep -c 不会查找单词，因此您必须搜索它。 - Foo Bah

0

这里有一个：

cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c

更新

一个Shell脚本解决方案：

#!/bin/bash

file_name="$2"
string="$1"

if [ $# -ne 2 ]
  then
   echo "Usage: $0 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo

- phoxis

0

你可以使用正则表达式，

 cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"

- Kimvais

你的解决方案甚至考虑了Joe和Tom在同一行。很好！ - Travis Nelson

@Travis：然而，它错误地只计算了一次“tomtom”，即使是我的爷爷也能看出有两个“tom”存在。 - tchrist

0

我完全忘记了 grep -f：

cat filename | grep -fc names

AWK解决方案：

假设名称在名为names的文件中：

cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -

请注意，您原始的grep不会搜索单词。例如：

$ echo tomorrow | grep -c tom
1

你需要使用grep -w。

- Foo Bah

0

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'

gawk程序将记录分隔符设置为任何非字母字符，因此每个单词都会出现在单独的一行上。然后grep计算与您想要的单词完全匹配的行数。

我们使用gawk，因为POSIX awk不允许正则表达式记录分隔符。

为了简洁起见，您可以用1替换'{print}' - 无论哪种方式，它都是一个Awk程序，只需打印所有输入记录（“1是真的吗？是的？那么执行默认操作，即{print}。”）

- hemflit

0

你提供的示例并没有搜索“tom”这个单词。它会计算“atom”、“bottom”和许多其他单词。
Grep 搜索正则表达式。匹配单词“tom”或“joe”的正则表达式是
```
\<\(tom\|joe\)\>
```

- Jan Hudec

0

查找所有行中的所有匹配项

echo "tom is really really cool!  joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3

这将把“tomtom”视为2个命中。

- Jotne

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- carlpett · Accepted Answer

好的，首先将文件分割成单词，然后进行排序和去重：

tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c

你可以使用 uniq 命令：

sort filename | uniq -c