给定三个术语文档矩阵——text1,text2和text3,我想计算每个矩阵中单词的频率,并将它们都合并到一个数据框中。这里只有三个样本,实际上我有数百个,所以我需要编写一个函数来实现这个功能。
对于一个术语文档矩阵,计算单词频率很容易:
apply(x, 1, sum)
或者
rowSums(as.matrix(x))
我想制作一个TDM列表:
tdm_list <- Filter(function(x) is(x, "TermDocumentMatrix"), mget(ls()))
并计算每个单词的频率,并将其放入数据框中:
data.frame(lapply(tdm_list, sum)) # this is wrong. it simply sums frequency of all words instead of frequency by each word.
然后将它们全部合并:
do.call(rbind, df_list)
我不知道如何使用lapply在TDM上计算单词频率。以下是示例数据:
require(tm)
text1 <- c("apple" , "love", "crazy", "peaches", "cool", "coke", "batman", "joker")
text2 <- c("omg", "#rstats" , "crazy", "cool", "bananas", "functions", "apple")
text3 <- c("Playing", "rstats", "football", "data", "coke", "caffeine", "peaches", "cool")
tdm1 <- TermDocumentMatrix(Corpus(VectorSource(text1)))
tdm2 <- TermDocumentMatrix(Corpus(VectorSource(text2)))
tdm3 <- TermDocumentMatrix(Corpus(VectorSource(text3)))
lapply(tdm_list, rowSums)
可以工作。 - Rich ScrivenError in FUN(X[[1L]], ...) : 'x'必须至少是两个维度的数组
。我尝试过这个! - vagabondfor (i in c(tdm1, tdm2, tdm3)) { apply(i, 1, sum) }
返回Error in apply(i, 1, sum) : dim(X) must have a positive length
。 - vagabond