BASH:统计相同行数

4

我有一个包含以下内容的文件:

VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoicemailButtonTest
VoiceMailConfig60CharsTest
VoicemailDefaultTypeTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoiceMailIconSelectableTest
VoicemailSettingsFromMessageModeScreenTest
VoicemailSettingsFromMessageModeScreenTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest
VoicemailSettingsTest

如何将重复的行替换为计数:

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

我将这对值放入了一个关联数组中。我尝试在while语句内使用“read”,但是数组会丢失。以下是我的尝试:

unset line
tests=$(cat file.log)
echo "$tests" | 
    while read l; do 
        if [ "$l" == "${line}" ]; then
            let cnt++;
        else
            echo "${line} (${cnt})"
            line=${l}
            cnt=1
        fi
        export run_suites
    done

1
你的做法完全错误。请参考 https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice 和搜索 UUOC。同时,永远不要使用字母 l 作为变量名,因为它看起来太像数字 1,会使你的代码变得晦涩难懂。 - Ed Morton
不选择答案是相当无礼的,请选择一个答案或说明为什么这些答案不够好。 - emilBeBri
6个回答

9
假设输出的格式不一定完全匹配。
VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

您可以直接使用。
sort <input_file> | uniq -c

如果您需要输出与您发布的完全匹配,您可以使用

awk '{duplicates[$1]++} END{for (ind in duplicates) {print ind,"("duplicates[ind]")"}}' <input_file>

编辑:在anubhava的回答之后发布...但是保留(除非有人建议我删除),因为添加了sort命令。


我会放弃了;我对自己的答案也有同样的想法,你比我快了12秒。 - chepner

4
如果您不关心确切的输出格式,只需使用 sortuniq:
$ sort file.log | uniq -c
5 VoicemailButtonTest
1 VoiceMailConfig60CharsTest
1 VoicemailDefaultTypeTest
5 VoiceMailIconSelectableTest
2 VoicemailSettingsFromMessageModeScreenTest
7 VoicemailSettingsTest

sort命令在文件已经排序的情况下是不必要的,就像你提问中所述。如果文件没有排序,uniq -c仍然可以工作,但它只认为与前一行完全相同的行是重复的:

$ printf 'a\nb\na' | uniq -c
1 a
1 b
1 a

3
你可以使用这个简单的awk脚本来获取计数:
awk '{freq[$1]++} END{for (i in freq) print i, "(" freq[i] ")"}' file

VoiceMailConfig60CharsTest (1)
VoicemailSettingsFromMessageModeScreenTest (2)
VoiceMailIconSelectableTest (5)
VoicemailButtonTest (5)
VoicemailDefaultTypeTest (1)
VoicemailSettingsTest (7)

如果您想保持输入顺序,请使用以下方式:
awk '!freq[$1]++{order[++k]=$1} END{
    for (i=1; i<=k; i++) print order[i], "(" freq[order[i]] ")"}' file

VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

1
谢谢你的好建议,Ed。我忘记了它是 gnu-awk 中的内置函数。 - anubhava

3

没有 awk 根据第一次出现的顺序保持键的顺序,并且不需要排序或分组输入。

cat -n file    |     # add line numbers for order
sort -k2       |     # sort based on keys, ignoring line no
uniq -f1 -c    |     # count keys, ignoring line no
sort -k2,2n    |     # sort by line no to recover initial order
sed -r 's/(\S+)\s+(\S+)\s+(\S+)/\3 (\1)/'     # format output

1
$ awk '$1 != prev{if (NR>1) print prev, "("cnt")"; prev=$1; cnt=0} {cnt++} END{print prev, "("cnt")"}' file
VoicemailButtonTest (5)
VoiceMailConfig60CharsTest (1)
VoicemailDefaultTypeTest (1)
VoiceMailIconSelectableTest (5)
VoicemailSettingsFromMessageModeScreenTest (2)
VoicemailSettingsTest (7)

上述方法保留了您的输入顺序,并且几乎不在内存中存储任何内容。它不关心您的输入是否已排序,只要您的输入文件中所有重复的键都像您在示例中展示的那样连续出现即可。

0

使用Bash数组

unset tab
declare -A tab
while read line;do
  let tab["$line"]=${tab["$line"]}+1
done < infile
for i in ${!tab[*]} ;do
  echo "$i  (${tab[$i]})"
done | sort

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接