在tm包的检查功能中,文本显示完全正确。然而,当我搜索单词频率时,一切都显示不正确:
问题在于文本显示为编码字符而不是单词。西里尔字符显示正确。因此,词云变得一团糟。
有没有可能以某种方式为tm函数分配编码?我尝试了this,但是文本本身没问题,问题出在使用tm包上。
让我们用一个样本文本:
One dream was to create an independent state that would stand equal with other countries of the world, taking its rightful place on the world map - a state that would be a source of pride for its people, with a rich history and a bright future. We have realized this dream. We have created the sovereign state of Kazakhstan. We have defined our national idea as "Mangilik El" (Eternal Nation). This is an idea that unites us, helps us to consolidate our nation, and directs us towards great goals. Together with independence, we have achieved the realization of the eternal aspirations of our people.
我的简单代码如下: (基于onertipaday.blogspot.com的教程:)
require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
1 2
44 4
findFreqTerms(ap.tdm, lowfreq=2)
[1] "<U+04D9>лем" "арман" "еді"
[4] "м<U+04D9><U+04A3>гілік"
那些单词应该是:“Әлем”,“арман”,“еді”,“мәңгілік”。它们在
inspect(ap.corpus)
输出中正确显示。非常感谢任何帮助! :)