如何从词云中删除单词?

3
我正在使用R中的wordcloud包和"Word Cloud in R"的帮助创建一个词云。
我可以很容易地完成这个任务,但我想从词云中删除一些单词。我有一个文件(实际上是一个Excel文件,但我可以更改),我想要排除其中所有的几百个单词。你有什么建议?
require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
ap.corpus=Corpus(DataframeSource(data.frame(as.character(data.merged2[,6]))))
ap.corpus=tm_map(ap.corpus, removePunctuation)
ap.corpus=tm_map(ap.corpus, tolower)
ap.corpus=tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm=TermDocumentMatrix(ap.corpus)
ap.m=as.matrix(ap.tdm)
ap.v=sort(rowSums(ap.m),decreasing=TRUE)
ap.d=data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)

4
不仅可以使用 stopwords("english"),也可以添加来自 Excel 文件的停用词。您可以将单词向量组合成一个停用词向量。这些单词将在词云中被排除。 - Tyler Rinker
2个回答

3

@Tyler Rinker已经给出了答案,只需再添加一行removeWords()即可,但是这里有更详细的说明。

假设你的Excel文件名为nuts.xls,其中只有一列单词,如下所示:

stopwords
peanut
cashew
walnut
almond
macadamia

R 中,您可以按照以下步骤进行操作。
     library(gdata) # package with xls import function
     library(tm)
     # now load the excel file with the custom stoplist, note a few of the arguments here 
     # to clean the data by removing spaces that excel seems to insert and prevent it from 
     # importing the characters as factors. You can use any args from read.table(), which is
     # handy
     nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE)

     # now make some words to build a corpus to test for a two-step stopword removal process...
     words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but")
     words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on")
     words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an")
     words.all<-data.frame(rbind(words1,words2,words3))
     words.corpus<-Corpus(DataframeSource((words.all)))

     # now remove the standard list of stopwords, like you've already worked out
     words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english"))
     # now remove the second set of stopwords, this time your custom set from the excel file, 
     # note that it has to be a reference to a character vector containing the custom stopwords
     words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords)

     # have a look to see if it worked
     inspect(words.corpus.nostopwords)
     A corpus with 3 text documents

     The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
          create_date creator 
     Available variables in the data frame are:
          MetaID 

     $words1
        , , , , apple, pear, orange, lime, mandarin, , , 

     $words2
        , , , , apple, pear, orange, lime, mandarin, , , 

     $words3
        , , , , apple, pear, orange, lime, mandarin, , , 

成功了!标准停用词已经被删除,Excel文件中自定义列表中的单词也已经被删除。毫无疑问,还有其他方法可以实现这一点。


感谢Ben和Tin Man的帮助。他们两个的某种组合对我很有帮助。我在使用gdata加载xls文件时遇到了一些问题,因为其中许多内容都被掩盖了,所以我的问题最终是由于Excel中额外的空格和包含多个单词的单元格引起的。尽管如此,我非常感激他们的帮助!谢谢! - user1108155

-1

将要制作数据云的数据转换为数据框。 创建一个包含您想要消除的单词的 CSV 文件,并将其读取为数据框。然后,您可以进行 anti_join 操作:

allWords = as.data.frame(table(bigWords$Words))

wordsToAvoid = read.csv("wordsToDrop.csv")

finalWords = anti_join(allWords, wordsToAvoid, by = "Words")

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接