在R tm中添加自定义停用词

Question

在R tm中添加自定义停用词

17

我在R中使用tm软件包拥有一个语料库。我正在使用removeWords函数来去除停用词。

我在R中使用tm包创建了一个语料库，现在我想使用removeWords函数去除停用词。

tm_map(abs, removeWords, stopwords("english"))

有没有办法将我自己的停用词添加到这个列表中？

- Brian

6个回答

4

将自定义的 停用词 存储在一个 csv 文件中 (例如： word.csv)。

library(tm)
stopwords <- read.csv("word.csv", header = FALSE)
stopwords <- as.character(stopwords$V1)
stopwords <- c(stopwords, stopwords())

然后，您可以将自定义单词应用于您的文本文件。

text <- VectorSource(text)
text <- VCorpus(text)
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, stopwords)
text <- tm_map(text, stripWhitespace)

text[[1]]$content

- Reza Rahimi

请使用4个空格缩进代码块（而不是用反引号括起来） - YakovL

2

您也可以使用textProcessor包。它的效果相当不错：

textProcessor(documents, 
  removestopwords = TRUE, customstopwords = NULL)

- Henryk Borzymowski

你如何修改textProcessor函数中的停用词？ - nak5120

2

您可以创建一个包含您自定义停用词的向量，并使用类似以下语句的代码：

“您可以创建一个包含您自定义停用词的向量，并使用如下语句：”

tm_map(abs, removeWords, c(stopwords("english"), myStopWords))

- Jeff J.

myStopWords 应该是一个列表还是字符？您能提供创建 myStopWords 的命令吗？这个命令可以吗 myStopWords <- read.csv('mystop.csv')？ - harsha

1

我正在使用停用词库而不是tm库。只是决定将我的解决方案放在这里，以防有人需要它。

# Create a list of custom stopwords that should be added
word <- c("quick", "recovery")
lexicon <-  rep("custom", times=length(word))

# Create a dataframe from the two vectors above
mystopwords <- data.frame(word, lexicon)
names(mystopwords) <- c("word", "lexicon")

# Add the dataframe to stop_words df that exists in the library stopwords
stop_words <-  dplyr::bind_rows(stop_words, mystopwords)
View(stop_words)

- Confusion Matrix

1

可以将自己的停用词添加到随 tm 安装而来的默认停用词列表中。"tm" 包带有许多数据文件，包括停用词，注意停用词文件适用于许多语言。您可以在停用词目录下添加、删除或更新 english.dat 文件。
查找停用词目录最简单的方法是通过文件浏览器在系统中搜索“stopwords”目录。然后您应该会发现许多其他语言文件以及 english.dat 文件。从 RStudio 打开 english.dat 文件，这样就可以编辑文件了 - 您可以根据需要添加自己的单词或删除现有单词。如果您想编辑其他语言的停用词，也是同样的过程。

- BMALURU

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- James · Accepted Answer

stopwords 只是提供一个单词向量，只需要将您自己的单词与此向量 c组合即可。

tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words"))