R tm软件包中“utf8towcs”输入无效。

Question

R tm软件包中“utf8towcs”输入无效。

36

我正在尝试使用R中的tm包进行一些文本分析。我尝试了以下内容：

require(tm)
dataSet <- Corpus(DirSource('tmp/'))
dataSet <- tm_map(dataSet, tolower)
Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger (Alt-)�lteppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'

问题在于一些字符无效。我希望能够在R内部或在导入进行处理前从分析中排除无效字符。我尝试使用iconv将所有文件转换为utf-8并排除无法转换为该格式的任何内容，方法如下：

find . -type f -exec iconv -t utf-8 "{}" -c -o tmpConverted/"{}" \;

如此处所指出的批量使用iconv将Latin-1文件转换为UTF-8

但我仍然遇到相同的错误。

我会非常感激任何帮助。

- maiaini

14个回答

25

以下内容来自tm faq：

它将用显示十六进制代码的字符串替换无法转换的字节。

希望这能帮到你，对我而言是有效的。

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

http://tm.r-forge.r-project.org/faq.html

- user1374611

14

我认为现在很清楚，问题是由于 tolower 无法理解表情符号所导致的。

#to remove emojis
dataSet <- iconv(dataSet, 'UTF-8', 'ASCII')

- Saurabh Yadav

10

我刚刚遇到了这个问题。你是否使用运行OSX的计算机？我正在使用这个操作系统，并且似乎已经将问题追溯到R在该操作系统上编译的字符集定义（请参见https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html）。

我的情况是，使用常见问题解决方案后，出现了以下情况：

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

给我的警告是：

Warning message:
it is not known that wchar_t is Unicode on this platform

我追踪到这个问题是由enc2utf8函数引起的。坏消息是，这是与我的底层操作系统有关，而不是R的问题。

所以这是我采用的解决办法：

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

这将强制iconv在Macintosh上使用utf8编码，而无需重新编译即可正常工作。

- Kenton

8

我经常遇到这个问题，而这篇Stack Overflow的帖子总是第一个出现的。我以前用过顶级解决方案，但它可能会剥夺字符并将它们替换为垃圾（例如将it’s转换为itâ€™s）。

我发现实际上有一个更好的解决方案！如果你安装了stringi包，你可以将tolower()替换为stri_trans_tolower()，然后一切都应该正常工作。

- Jacqueline Nolis

4

我一直在Mac上运行这个程序，但令我沮丧的是，我不得不找出那些有问题的记录（因为它们是推文）来解决。由于下一次记录可能不同，所以我使用了以下函数：

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

如上所建议。

它非常有效。

- Krishna Vedula

2

这是tm包的一个常见问题(1, 2, 3)。

解决方法之一是使用文本编辑器在将文本加载到R之前查找和替换所有花哨字符(即带变音符号的字符)（或在R中使用gsub）。例如，您可以搜索并替换Öl-Teppich中O-umlaut的所有实例。其他人已经成功地使用了这种方法（我也是），但如果您有成千上万个单独的文本文件，显然这样做不好。

对于一个R解决方案，我发现使用`VectorSource`而不是`DirSource`似乎可以解决问题：

# I put your example text in a file and tested it with both ANSI and 
# UTF-8 encodings, both enabled me to reproduce your problem
#
tmp <- Corpus(DirSource('C:\\...\\tmp/'))
tmp <- tm_map(dataSet, tolower)
Error in FUN(X[[1L]], ...) : 
  invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
# quite similar error to what you got, both from ANSI and UTF-8 encodings
#
# Now try VectorSource instead of DirSource
tmp <- readLines('C:\\...\\tmp.txt') 
tmp
[1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"
# looks ok so far
tmp <- Corpus(VectorSource(tmp))
tmp <- tm_map(tmp, tolower)
tmp[[1]]
rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp
# seems like it's worked just fine. It worked for best for ANSI encoding. 
# There was no error with UTF-8 encoding, but the Ö was returned 
# as ã– which is not good

但这似乎有点幸运的巧合。肯定有更直接的方法。请告诉我们您的方法！

- Ben

1

感谢您的回复Ben！出于某种原因，之前让我失败的那行代码现在可以工作了。我不知道这是否又是另一个幸运的巧合 :) 我什么也没改，只是重新运行它，这次没有任何问题。 - maiaini

2

之前的建议对我没有用。我进行了更多的调查，并在以下网址中找到了一个可行的方法：https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/

#Create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",
x))})
# Apply it for substituting the regular expression given in one of the former answers by " "
your_corpus<- tm_map(your_corpus,toSpace,"[^[:graph:]]")

# the tolower transformation worked!
your_corpus <- tm_map(your_corpus, content_transformer(tolower))

- vicarizmendi

1

如果可以忽略无效的输入，你可以使用R的错误处理。例如：

  dataSet <- Corpus(DirSource('tmp/'))
  dataSet <- tm_map(dataSet, function(data) {
     #ERROR HANDLING
     possibleError <- tryCatch(
         tolower(data),
         error=function(e) e
     )

     # if(!inherits(possibleError, "error")){
     #   REAL WORK. Could do more work on your data here,
     #   because you know the input is valid.
     #   useful(data); fun(data); good(data);
     # }
  })

这里还有一个额外的例子：http://gastonsanchez.wordpress.com/2012/05/29/catching-errors-when-using-tolower/

- Rose Perrone

1

官方常见问题解答在我的情况下似乎无法正常工作：

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))

最后，我使用for循环和编码函数成功了：

for (i in 1:length(dataSet))
{
  Encoding(corpus[[i]])="UTF-8"
}
corpus <- tm_map(dataSet, tolower)

- pudding

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David · Accepted Answer

以上答案对我都不起作用。解决这个问题的唯一方法是删除所有非图形字符 (http://stat.ethz.ch/R-manual/R-patched/library/base/html/regex.html)。

代码是这么简单

usableText=str_replace_all(tweets$text,"[^[:graph:]]", " ")