R tm removeWords函数无法移除单词

9

我正在尝试从我建立的语料库中删除一些单词,但似乎没有起作用。我首先遍历整个语料库,并创建一个数据框,按照频率顺序列出我的单词。我使用此列表来识别我不感兴趣的单词,然后尝试创建一个新列表,其中包含已删除的单词。然而,这些单词仍然存在于我的数据集中。我想知道我做错了什么以及为什么这些单词没有被删除?我在下面包含了完整代码:

install.packages("rvest")
install.packages("tm")
install.packages("SnowballC")
install.packages("stringr")
library(stringr) 
library(tm) 
library(SnowballC) 
library(rvest)

# Pull in the data I have been using. 
paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192")
paperURLs <- paperList %>%
  html_nodes(xpath="//*[@class='search-results-title']/a") %>%
  html_attr("href")
paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "")
paper_html <- sapply(1:length(paperURLs), function(x) html(paperURLs[x]))

paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>%
                      html_nodes(xpath="//*[@class='article-content']") %>%
                      html_text() %>%
                      str_trim(.))
# Create corpus
paperCorp <- Corpus(VectorSource(paperText))
for(j in seq(paperCorp))
{
  paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]])
  paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]])
  paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]])
}

paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)

paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))

paperCorp <- tm_map(paperCorp, stemDocument)

paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)

dtm <- DocumentTermMatrix(paperCorpPTD)

termFreq <- colSums(as.matrix(dtm))
head(termFreq)

tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)

# After having identified words I am not interested in
# create new corpus with these words removed.
paperCorp1 <- tm_map(paperCorp, removeWords, c("also", "article", "Article", 
                                              "download", "google", "figure",
                                              "fig", "groups","Google", "however",
                                              "high", "human", "levels",
                                              "larger", "may", "number",
                                              "shown", "study", "studies", "this",
                                              "using", "two", "the", "Scholar",
                                              "pubmedncbi", "PubMedNCBI",
                                              "view", "View", "the", "biol",
                                              "via", "image", "doi", "one", 
                                              "analysis"))

paperCorp1 <- tm_map(paperCorp1, stripWhitespace)
paperCorpPTD1 <- tm_map(paperCorp1, PlainTextDocument)
dtm1 <- DocumentTermMatrix(paperCorpPTD1)
termFreq1 <- colSums(as.matrix(dtm1))
tf1 <- data.frame(term = names(termFreq1), freq = termFreq1)
tf1 <- tf1[order(-tf1[,2]),]
head(tf1, 100)

如果你查看tf1,你会发现许多指定要删除的单词实际上并没有被删除。我想知道我做错了什么,以及如何从我的数据中删除这些单词?注意:removeWords正在执行某些操作,因为head(tm,100)head(tm1,100)的输出并不完全相同。因此,removeWords似乎删除了一些我试图摆脱的单词实例,但并非所有实例。

2
你的代码中有一个错别字。paperCorp1 <- tm_map(paperCorp, removeWords, c("the")) 应该是 paperCorp1 <- tm_map(paperCorp1, removeWords, c("the")) - phiver
嗨@phiver,感谢您注意到这一点。我在试图解决问题时不小心留下了那行代码。删除该代码后,我仍然面临着同样的问题。我试图从tf1中删除许多单词,包括“the”。 - Adam
可能是因为有大写字母。尝试使用:paperCorp <- tm_map(paperCorp,tolower) - scoa
2个回答

15

我改变了一些代码并添加了 tolower。停用词都是小写的,所以在删除停用词之前需要先将其转换为小写。

paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)
# added tolower
paperCorp <- tm_map(paperCorp, tolower)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
# moved stripWhitespace
paperCorp <- tm_map(paperCorp, stripWhitespace)

paperCorp <- tm_map(paperCorp, stemDocument)

由于我们将所有内容都转换为小写字母,因此不再需要使用大写字母。您可以将它们删除。

paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "Article", 
                                               "download", "google", "figure",
                                               "fig", "groups","Google", "however",
                                               "high", "human", "levels",
                                               "larger", "may", "number",
                                               "shown", "study", "studies", "this",
                                               "using", "two", "the", "Scholar",
                                               "pubmedncbi", "PubMedNCBI",
                                               "view", "View", "the", "biol",
                                               "via", "image", "doi", "one", 
                                               "analysis"))

paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)

dtm <- DocumentTermMatrix(paperCorpPTD)

termFreq <- colSums(as.matrix(dtm))
head(termFreq)

tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)

           term  freq
fatty     fatty 29568
pparα     ppara 23232
acids     acids 22848
gene       gene 15360
dietary dietary 12864
scholar scholar 11904

tf[tf$term == "study"]


data frame with 0 columns and 1659 rows

正如您所看到的,结果是该研究不再在语料库中。其余的单词也消失了。


6
如果有人像我一样遇到了错误,而上述解决方案仍然无法解决问题,请尝试使用以下代码:paperCorp <- tm_map(paperCorp, content_transformer(tolower)) 而不是 paperCorp <- tm_map(paperCorp, tolower)。因为 tolower() 是一个来自基础包的函数,返回不同的结构(我的意思是改变结果类型中的某些内容),所以你不能使用例如 paperCorp[[j]]$content,只能使用 paperCorp[[j]]。这只是一个离题的讨论,也许对某些人有帮助。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接