如何在Bash脚本中从字符串中删除重复单词？

Question

如何在Bash脚本中从字符串中删除重复单词？

7

我有一个包含重复单词的字符串，比如：

abc, def, abc, def

如何去除重复项？我需要的字符串是：

abc, def

- Thanh Tran

它们都是用逗号分隔的吗？ - fedorqui

@fedorqui: 我已经改变了我的字符串输入和运行编辑命令，如最后的评论所述。工作正常！谢谢。 - Thanh Tran

5个回答

3

你可以使用 awk 来完成此操作。

示例：

#!/bin/bash
string="abc, def, abc, def"
string=$(printf '%s\n' "$string" | awk -v RS='[,[:space:]]+' '!a[$0]++{printf "%s%s", $0, RT}')
string="${string%,*}"
echo "$string"

输出：

abc, def

- Jahid

2

这也可以在纯Bash中完成：

#!/bin/bash

string="abc, def, abc, def"

declare -A words

IFS=", "
for w in $string; do
  words+=( [$w]="" )
done

echo ${!words[@]}

输出

def abc

解释

words 是一个关联数组（declare -A words），其中每个单词都被添加为一个键：

words+=( [${w}]="" )

我们不需要它的值，因此我已将""作为值。

独特单词列表是键的列表（${!words[@]}）。

注意，输出没有使用", "分隔。（您需要再次迭代。IFS仅用于${words[*]}，即使如此，也仅使用IFS的第一个字符。）

- Micha Wiedenmann

1

我有另一种解决此问题的方法。我改变了我的输入字符串，如下所示，并运行命令对其进行编辑：

#string="abc def abc def"
$ echo "abc def abc def" | xargs -n1 | sort -u | xargs |  sed "s# #, #g"
abc, def

感谢所有的支持！

- Thanh Tran

0

使用关联数组或xargs和其他示例中的sort存在问题，因为单词会被排序。我的解决方案只跳过已经处理过的单词。关联数组map保留了这些信息。

Bash函数

function uniq_words() {

  local string="$1"
  local delimiter=", "  
  local words=""

  declare -A map

  while read -r word; do
    # skip already processed words
    if [ ! -z "${map[$word]}" ]; then
      continue
    fi

    # mark the found word
    map[$word]=1

    # don't add a delimiter, if it is the first word
    if [ -z "$words" ]; then
      words=$word
      continue
    fi

    # add a delimiter and the word
    words="$words$delimiter$word"

  # split the string into lines so that we don't have
  # to overwrite the $IFS system field separator
  done <<< $(sed -e "s/$delimiter/\n/g" <<< "$string")

  echo ${words}
}

示例1

uniq_words "abc, def, abc, def"

输出：

abc, def

例子 2

uniq_words "1, 2, 3, 2, 1, 0"

输出：

1, 2, 3, 0

xargs 和 sort 的示例

在这个示例中，输出已经被排序。

echo "1 2 3 2 1 0" | xargs -n1 | sort -u | xargs |  sed "s# #, #g"

输出：

0, 1, 2, 3

- Markus D.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- John1024 · Accepted Answer

我们有这个测试文件：

$ cat file
abc, def, abc, def

去除重复单词：

$ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def

工作原理

:a

定义标签a。
s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g

查找由字母数字字符组成的重复单词，并删除第二个出现的单词。
ta

如果最后一次替换命令导致了更改，则跳转回标签a以再次尝试。

通过这种方式，代码会一直寻找重复单词，直到没有剩余为止。
s/(, )+/, /g; s/, *$//

这两个替换命令清除任何剩余的逗号空格组合。

Mac OSX或其他BSD系统

对于Mac OSX或其他BSD系统，请尝试：

sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file

使用字符串而非文件

sed可以轻松地处理来自文件的输入，如上所示，也可以处理来自shell字符串的输入，如下所示：

$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef