使用sed删除停用词列表中的单词（将参数列表提供给sed从文本文件中删除）

Question

使用sed删除停用词列表中的单词（将参数列表提供给sed从文本文件中删除）

3

我们都知道sed非常擅长在文件中查找和替换所有单词的出现：

sed -i 's/original_word/new_word/g' file.txt

但是，有人可以向我展示如何从文件中向sed提供“original_words”列表吗（类似于grep -f）？我只想用''替换所有内容（删除它们）。

原始单词列表文件只是一堆以行分隔的停用词（wordlist.txt）：

a
about
above
according
across
after
afterwards

这是一种简单的方法，可以将停用词列表从语料库中删除（用于数据清理）。file.txt文件如下：

05ricardo   RT @shakira: Immigration reform isn't about politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me a copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

- Chris J. Vargo

请检查此链接：http://theunixshell.blogspot.com/2013/01/perls-equivalent-of-grep-f.html - Vijay

5个回答

1

首先，并非所有的 sed 支持 -i，但这并不是必要的选项，因为可以以一般方式提供该功能。一个简单的选项（假设使用非csh家族的shell）：

inline() { f=$1; shift; "$@" < $f > $f.out && mv $f.out $f; }

然后，要进行替换（您没有指定如何处理单词分隔符，因此如果“foo”在黑名单中，“bar foo baz”将以“bar”和“baz”之间有两个空格的形式结束），使用awk或perl非常简单：

awk 'NR==FNR{a[$0]; next} {for( i in a ) gsub( i, "" )} 1' original-words file.txt
perl -wne 'if( $ARGV = $ARGV[0] ){ chop; push @no, $_; next } 
    foreach $x( @no ) {s/$x//g } print ' original-words file.txt

如果您对结果感到满意，可以使用perl和-i（并非所有版本的sed都支持-i，但是所有版本的perl都支持> 5.0），或者您可以使用以下方式修改文件：

inline file.txt awk 'NR==FNR{a[$0]; next} 
    {for( i in a ) gsub( i, "" )} 1' original-words -

任何一个解决方案都比在黑名单中的每个单词上调用sed更快。

- William Pursell

1

这是一种使用GNU sed的方法：

：

while IFS= read -r word; do sed -ri "s/( |)\b$word\b//g" file; done < wordlist

Contents of file:

how about I decide to look at it afterwards. What
across do you think? Is it a good idea to go out and about? I 
think I'd rather go up and above.

结果：

how I decide to look at it. What
 do you think? Is it good idea to go out and? I 
think I'd rather go up and.

- Steve

感谢您的评论！对我来说，它不喜欢-r： sed：非法选项--r - Chris J. Vargo

1

很遗憾，看起来你没有使用 GNU sed。你可能在 OSX 上使用的是 BSD sed。如果你去掉 -r 标志，你需要删除单词边界 (\b)。 - Steve

确实是。谢谢您。这对于任何运行GNU的人来说都是一个不错的解决方案。 - Chris J. Vargo

每个停用词调用一次 sed 是极其低效的，特别是当停用词列表很大且文件很大时。 - Jonathan Leffler

0

也许这样

#!/bin/sh
while read k
do
  sed -i "s/$k//g" file.txt
done < dict.txt

- Zombo

sed: 1: "file.txt": 命令代码 S 无效。不确定它是否喜欢 $k。 - Chris J. Vargo

抱歉，我已经添加了它。 - Chris J. Vargo

-1

cat file.txt | grep  -vf wordlist.txt

- alemol

这将删除包含任何停用词的行，而不仅仅是删除停用词。 - Jonathan Leffler

它还会匹配单词的一部分，因此如果a是停用词，那么剩下的内容就不多了... - alexis

我不确定，但也许 -x 可以解决那个问题。 - Sridhar Sarnobat

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thor · Accepted Answer

您也可以让sed为您编写sed脚本（使用GNU sed测试过）：

<stopwords sed 's:.*:s/\\b&\\b//:g' | sed -f - file.txt

输出：

05ricardo   RT @shakira: Immigration reform isn't  politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me  copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3