使用tidytext去除停用词

Question

使用tidytext去除停用词

3

使用tidytext，我有如下代码：

data(stop_words)
tidy_documents <- tidy_documents %>%
      anti_join(stop_words)

我希望它使用包中内置的停用词，将一个名为tidy_documents的数据框写入到同名的数据框中，但如果这些单词在停用词中，则将其删除。

我得到了以下错误：

错误：没有公共变量。请指定“by”参数。回溯：

1. tidy_documents %>% anti_join(stop_words)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(expr, envir, enclos)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. anti_join(., stop_words)
10. anti_join.tbl_df(., stop_words)
11. common_by(by, x, y)
12. stop("No common variables. Please specify `by` param.", call. = FALSE)

- Simon Lindgren

显然，tidy_documents和stop_words没有共享任何变量名称，因此您需要使用by参数匹配这两个数据集。 - Axeman

stop_words列被称为word，因此请使用该名称命名您的列或使用anti_join函数的by参数。 - alistaire

“tidy_documents”中的列名是什么？如果您分享了这些信息，我们可以具体告诉您如何设置连接。 - Julia Silge

@JuliaSilge tidy_documents 中的列是 作者; 日期; 单词。 - Simon Lindgren

1

@textnet 嗯，那看起来很奇怪。如果您在主数据集中有一个word列，我希望anti_join()会知道将其与stop_words数据集中的word列匹配起来。您能否尝试使用数据生成可再现的示例？ - Julia Silge

@JuliaSilge 谢谢，但我想我已经让它工作了。像这样 data(stop_words) tidy_base <- anti_join(tidy_base, stop_words, by="word")。看起来合理吗？ - Simon Lindgren

2个回答

13

无论是 tidy_document 还是 stop_words，它们都有一个列名为 word 的单词列表；但是，它们的列是相反的：在 stop_words 中，它是第一列，而在您的数据集中它是第二列。这就是命令无法“匹配”这两个列并比较单词的原因。请尝试以下操作：

tidy_document <- tidy_document %>% 
      anti_join(stop_words, by = c("word" = "word"))

< p > by 命令强制脚本比较被称为word的列，而不管它们的位置。

- Vale Baia

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rohit · Accepted Answer

你可以使用更简单的filter()来避免使用令人困惑的anti_join()函数，如下所示：

tidy_documents <- tidy_documents %>%
  filter(!word %in% stop_words$word)