我正在尝试在Windows上将一个使用OEM-866(西里尔字符集)编码的CSV文件导入R中。我还有一份已经转换为UTF-8 w/o BOM的副本。这两个文件对我的系统上的所有其他应用程序来说都是可读的,一旦指定了编码。
此外,在Linux上,R可以很好地读取这些特定编码的文件。如果我不指定“fileEncoding”参数,在Windows上也可以读取CSV文件,但会导致无法读取的文本。当我在Windows上指定文件编码时,无论是OEM还是Unicode文件,我总是会收到以下错误:
原始OEM文件导入:
> oem.csv <- read.table("~/csv1.csv", sep=";", dec=",", quote="",fileEncoding="cp866") #result: failure to import all rows
Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
invalid input found on input connection '~/Revolution/RProject1/csv1.csv'
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
number of items read is not a multiple of the number of columns
不带BOM的UTF-8文件导入:
> unicode.csv <- read.table("~/csv1a.csv", sep=";", dec=",", quote="",fileEncoding="UTF-8") #result: failure to import all row
Warning messages:
1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
invalid input found on input connection '~/Revolution/RProject1/csv1a.csv'
2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
number of items read is not a multiple of the number of columns
本地化信息:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
在Windows上,是什么原因导致R出现这种情况?我已经尝试了几乎所有可能的方法,除了放弃Windows。
谢谢你
(其他失败的尝试):
>Sys.setlocale("LC_ALL", "en_US.UTF-8") #OS reports request to set locale to "en_US.UTF-8" cannot be honored
>options(encoding="UTF-8") #now nothing can be imported
> noarg.unicode.csv <- read.table("~/Revolution/RProject1/csv1a.csv", sep=";", dec=",", quote="") #result: mangled cyrillic
> encarg.unicode.csv <- read.table("~/Revolution/RProject1/csv1a.csv", sep=";", dec=",", quote="",encoding="UTF-8") #result: mangled cyrillic
head
命令,则会出现乱码。这让我认为问题是控制台如何显示非拉丁字符而不是导入时引起的。 - Alex Popov