R - 情感分析 - 如何删除特定词汇

Question

R - 情感分析 - 如何删除特定词汇

3

我有以下代码用于创建干净的文本进行Twitter情感分析。我想添加另一行来删除某些不想包含在此分析中的单词，例如“crap”，“sick”等。请问有人能告诉我如何实现吗？

tweets <- searchTwitter("iPhone", n=1500, lang="en")
txt <- sapply(tweets, function(x) x$getText())
txt <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", txt)
txt <- gsub("@\\w+", "", txt)
txt <- gsub("[[:punct:]]", "", txt)
txt <- gsub("[[:digit:]]", "", txt)
txt <- gsub("http\\w+", "", txt)
txt <- gsub("[ \t]{2,}", "", txt)
txt <- gsub("^\\s+|\\s+$", "", txt)

- Ryo

Ryo.. 我猜你可能已经读过这篇博客：https://mkmanu.wordpress.com/2014/08/05/sentiment-analysis-on-twitter-data-text-analytics-tutorial/ - undefined

你可以对gsub进行向量化处理。查看这个关于“使用 gsub 替换多个参数”的回答。这样还可以简化你的代码。 - undefined

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Manoj Kumar · Accepted Answer

2

使用R中最新的“tm”包，您可以删除单词..

library(tm)
myCorpOld <- Corpus(VectorSource(YourFirstDFonTweet$text)

请注意在关于语料库制作方面，"YourFirstDFonTweet"是您可能从下载的推文中创建的数据框（Dataframe）。

#remove "crap" and "sick" from 
txt <- setdiff(say_txt, c("crap", "sick"))

#remove these form corpus
myCorpUpdate <- tm_map(myCorpOld, txt)

希望这能给你一个解决问题的思路。

- Manoj Kumar

有没有其他方法可以使用gsub来删除这两个单词？ - undefined

使用gsub函数，你只能一次删除一个单词。例如，你有一条推文：data <- c("This is an example tweet. Here is my crap email : emailaddress@try.com. So many crap things here.")，而你想要删除单词"crap"，可以使用gsub函数.... gsub("crap", "", data) 你会得到以下结果："This is an example tweet. Here is my email emailaddress@try.com. So many things here." - undefined

非常感谢你，Manoj！ - undefined

@Ryo 我忘了一件事，当你使用 gsub 删除一些单词时，可能会创建一些空格。如果这些空格影响到你的情感评分，你可以使用 gsub 来去除空格，尽管它们不应该产生影响。 - undefined

我认为txt <- gsub("[ \t]{2,}", "", txt)和txt <- gsub("^\\s+|\\s+$", "", txt)是用来去除空格的？我需要其他什么来去除空白字符吗？ - undefined

我尝试了你的答案，但是出现了错误：对于类别为"list"的对象，没有适用于'tm_map'的方法。你有解决办法吗？ - undefined