R-Project中没有适用于“meta”的方法,该方法应用于类“character”的对象。

32

我正在尝试运行这段代码(Ubuntu 12.04,R 3.1.1)

# Load requisite packages
library(tm)
library(ggplot2)
library(lsa)

# Place Enron email snippets into a single vector.
text <- c(
  "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
  "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
  "you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it",
  "This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . .",
  "Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.",
  "Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.",
  "We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.",
  "The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.",
  "In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.")
view <- factor(rep(c("view 1", "view 2", "view 3"), each = 3))
df <- data.frame(text, view, stringsAsFactors = FALSE)

# Prepare mini-Enron corpus
corpus <- Corpus(VectorSource(df$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus # check corpus

# Mini-Enron corpus with 9 text documents

# Compute a term-document matrix that contains occurrance of terms in each email
# Compute distance between pairs of documents and scale the multidimentional semantic space (MDS) onto two dimensions
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat  # check distance matrix

# Compute distance between pairs of documents and scale the multidimentional semantic space onto two dimensions
fit <- cmdscale(dist.mat, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y, color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))

然而,当我运行它时,会出现这个错误(在td.mat <- as.matrix(TermDocumentMatrix(corpus))行):

Error in UseMethod("meta", x) : 
  no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
  all scheduled cores encountered errors in user code

我不确定要查看什么-所有模块都已加载。


我无法复现。你可能没有安装最新版本的软件包(特别是 tm)吗? - David Robinson
@DavidRobinson 你测试的是哪个版本的 tm?据我所知,0.6 是最新的版本。 - MrFlick
@MrFlick:我的错误:我昨晚使用install.packages安装了它,并收到了tm_0.5-10,但我现在意识到这是因为我正在使用R 3.0.1(是时候升级了),而最新的tm需要>=3.1.0 - David Robinson
4个回答

94

最新版本的tm(0.60)使得您不能再使用在简单字符值上操作的函数与tm_map。因此,问题出在您的tolower步骤,因为它不是一种“规范”的转换(请参见getTransformations())。只需将其替换为

corpus <- tm_map(corpus, content_transformer(tolower))

content_transformer 函数包装器将在语料库中将所有内容转换为正确的数据类型。您可以将 content_transformer 与任何旨在操作字符向量的函数一起使用,以使其在 tm_map 管道中运行。


谢谢,但在较新的版本中如何实现这个?corpus <- tm_map(corpus, stemDocument, language = "english") @MrFlick - Vladimir Stazhilov
@VladimirStazhilov 那行代码应该可以正常工作,无需修改。如果您遇到问题,请考虑打开一个新的问题并提供可重现的错误。 - MrFlick
即使我使用自定义函数生成一些处理后的纯字符串,这对我仍然有效。我只需使用 texts = tm_map(texts, content_transformer(custom_func)) - Rafs

29

这篇文章可能有点过时,但是为了以后的谷歌搜索方便:有一种替代方案。在执行 corpus <- tm_map(corpus, tolower) 后,你可以使用 corpus <- tm_map(corpus, PlainTextDocument),它可以将数据类型转换成正确的类型。


您是个传奇,先生!!!我又一次通过不忽略Stackoverflow中的评论而节省了一天的工作 :) - Scott85044

1
我曾经遇到同样的问题,最终找到了解决方法:
在对语料对象应用转换后,其中的元信息似乎会损坏。
我的做法是,在整个过程结束后重新创建语料库。为了克服其他问题,我还编写了一个循环,以便将文本复制回数据框中:
a<- list()
for (i in seq_along(corpus)) {
    a[i] <- gettext(corpus[[i]][[1]]) #Do not use $content here!
}

df$text <- unlist(a) 
corpus <- Corpus(VectorSource(df$text)) #This action restores the corpus.

0

文本操作的顺序很重要。在删除标点符号之前,应该先删除停用词。

我使用以下内容来准备文本。我的文本包含在cleanData$LikeMost中。

有时,根据来源,您需要首先执行以下操作:

textData$LikeMost <- iconv(textData$LikeMost, to = "utf-8")

有些停用词很重要,因此您可以创建一个修订后的集合。

#create revised stopwords list
newWords <- stopwords("english")
keep <- c("no", "more", "not", "can't", "cannot", "isn't", "aren't", "wasn't",
          "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't")


newWords <- newWords [! newWords %in% keep]

然后,您可以运行您的tm函数:

like <- Corpus(VectorSource(cleanData$LikeMost))
like <- tm_map(like,PlainTextDocument)
like <- tm_map(like, removeWords, newWords)
like <- tm_map(like, removePunctuation)
like <- tm_map(like, removeNumbers)
like <- tm_map(like, stripWhitespace)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接