数据框中的重音字符

Question

数据框中的重音字符

6

我对为什么某些字符（例如“Ě”，“Č”和“ŝ”）会在数据框中失去其变音符号感到困惑，而其他字符（例如“Š”和“š”）则不会。顺便说一下，我的操作系统是Windows 10。在下面的示例代码中，向量czechvec具有11个单字符字符串，都是斯拉夫重音字符。R可以正确显示这些字符。然后创建了一个名为mydf的数据框，其中czechvec作为第二列（使用函数I（）使其不会转换为因子）。但是，当R显示mydf或mydf的任何行时，它将大多数这些字符转换为其纯ASCII等效项；例如，mydf [3，]显示字符为“E”，而不是“Ě”。但是，通过行和列进行下标，例如mydf [3，2]，它会正确地显示带重音符号的字符（“Ě”）。为什么R显示整行还是只显示一个单元格会有区别呢？为什么像“Š”这样的字符完全不受影响呢？并且，当我将此数据框写入文件时，即使我指定fileEncoding =“UTF-8”，它也会完全丢失重音符号。

> charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
> hexvals  <- as.hexmode(charvals)
> czechvec <- unlist(strsplit(intToUtf8(charvals), ""))
> czechvec
[1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"
> 
> mydf = data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
> mydf
   dec char  hex
1  193    Á 00C1
2  269    c 010D
3  282    E 011A
4  268    C 010C
5  262    C 0106
6  263    c 0107
7  348    S 015C
8  349    s 015D
9  350    S 015E
10 352    Š 0160
11 353    š 0161
> mydf[3,2]
[1] "Ě"
> mydf[3,]
  dec char  hex
3 282    E 011A
> 
> write.table(mydf, file="myfile.txt", fileEncoding="UTF-8")
> 
> df2 <- read.table("myfile.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")
> df2[3,2]
[1] "E"

编辑后补充：根据Ernest A的答案，这种行为在Linux中无法重现。这一定是Windows的问题。(我使用的是Windows下的R 3.4.1版本。)

- Montgomery Clift

3个回答

1

感谢Ernest A的回答，确认我观察到的奇怪行为在Linux中不存在。我Google搜索了“R WINDOWS UTF-8 BUG”，找到了Ista Zahn的这篇文章：Escaping from character encoding hell in R on Windows。

该文章确认了Windows上data.frame打印方法存在bug，并提供了一些解决方法。（然而，该文章没有注意到在Windows上使用write.table时，对于包含外语文本的数据框也存在问题。）

Zahn提出的一个解决方法是根据我们所使用的特定语言更改区域设置。

Sys.setlocale(category = "LC_CTYPE", locale = "czech")
charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
hexvals  <- format(as.hexmode(charvals), width=4, upper.case=TRUE)
df1      <- data.frame(dec=charvals, char=I(unlist(strsplit(intToUtf8(charvals), ""))), hex=I(hexvals))

print.listof(df1)

dec :
 [1] 193 269 282 268 262 263 348 349 350 352 353

char :
 [1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"

hex :
 [1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"

df1
   dec char  hex
1  193    Á 00C1
2  269    č 010D
3  282    Ě 011A
4  268    Č 010C
5  262    Ć 0106
6  263    ć 0107
7  348    S 015C
8  349    s 015D
9  350    Ş 015E
10 352    Š 0160
11 353    š 0161

注意，现在捷克字符显示正确了，但是“Ŝ”和“ŝ”（Unicode U+015C和U+015D）似乎在世界语中使用，没有正确显示。但是使用print.listof命令，所有字符都能正确显示。（顺便说一句，dput(df1)会将世界语字符错误地列为“S”和“s”。）

write.table(df1, file="special characters example.txt", fileEncoding="UTF-8")
df2 <- read.table("special characters example.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")

print.listof(df2)
dec :
 [1] 193 269 282 268 262 263 348 349 350 352 353

char :
 [1] "Á" "č" "Ě" "Č" "Ć" "ć" "S" "s" "Ş" "Š" "š"

hex :
 [1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"

当我使用write.table命令将df1写入文件，然后用read.table读取为df2时，"Ŝ"和"ŝ"字符失去了其抑扬符号。这一定是write.table命令的问题，当我使用其他应用程序（如OpenOffice Writer）打开文件时得到了确认。捷克字符都正确显示，但"Ŝ"和"ŝ"已被更改为"S"和"s"。

目前，为了达到我的目的，最好的解决方法是，在我的数据框中不使用实际字符，而是记录其Unicode值，然后使用write.table，并在OpenOffice Calc中使用UNICHAR函数将字符本身添加到文件中。但这很不方便。

我相信这个bug与这个问题有关：如何在R中以utf-8格式读取数据？编辑添加：我现在在Stack Overflow上找到了其他类似的问题：

为什么在R中，某些Unicode字符可以显示在矩阵中但无法显示在数据框中？

在R中进行UTF-8文件输出

使用R编写UTF-8文件

我在这里找到了Peter Meissner提供的解决方法：

http://r.789695.n4.nabble.com/Unicode-display-problem-with-data-frames-under-Windows-tp4707639p4707667.html

它涉及定义自己的类unicode_df和打印函数print.unicode_df。

这仍然无法解决我在使用write.table写入包含各种欧洲语言文本列的数据框时遇到的问题，以便将其写入可以导入电子表格或任意应用程序的文件。但也许Meissner的解决方案可以适应于使用write.table。

- Montgomery Clift

0

这里有一个函数 write.unicode.csv，它使用 paste 和 writeLines（使用 useBytes=TRUE）将包含外语字符（编码为 UTF-8）的数据框导出到 csv 文件。在 csv 文件中，数据框中的所有单元格都将用引号括起来。

#function that will create a CSV file for a data frame containing Unicode text
#this can be used instead of write.csv in R for Windows
#source: https://dev59.com/hqXja4cB1Zd3GeqPLAva
#this is not elegant, and probably not robust

write.unicode.csv <- function(mydf, filename="") {  #mydf can be a data frame or a matrix
   linestowrite <- character( length = 1+nrow(mydf) )
   linestowrite[1] <- paste('"","', paste(colnames(mydf), collapse='","'), '"', sep="") #first line will have the column names
   if(nrow(mydf)<1 | ncol(mydf)<1) print("This is not going to work.")        #a bit of error checking
   for(k1 in 1:nrow(mydf)) {
     r <- paste('"', k1, '"', sep="") #each row will begin with the row number in quotes
     for(k2 in 1:ncol(mydf)) {r <- paste(r, paste('"', mydf[k1, k2], '"', sep=""), sep=",")}
     linestowrite[1+k1] <- r
     }
   writeLines(linestowrite, con=filename, useBytes=TRUE)
   } #end of function

Sys.setlocale(category = "LC_CTYPE", locale = "usa")
charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
hexvals  <- format(as.hexmode(charvals), width=4, upper.case=TRUE)
df1      <- data.frame(dec=charvals, char=I(unlist(strsplit(intToUtf8(charvals), ""))), hex=I(hexvals))

print.listof(df1)

write.csv(df1, file="test1.csv")
write.csv(df1, file="test2.csv", fileEncoding="UTF-8")
write.unicode.csv(df1, filename="test3.csv")

dftest1 <- read.csv(file="test1.csv", encoding="UTF-8", colClasses="character")
dftest2 <- read.csv(file="test2.csv", encoding="UTF-8", colClasses="character")
dftest3 <- read.csv(file="test3.csv", encoding="UTF-8", colClasses="character")

print("CSV file written using write.csv with no fileEncoding parameter:")
print.listof(dftest1)

print('CSV file written using write.csv with fileEncoding="UTF-8":')
print.listof(dftest2)

print("CSV file written using write.unicode.csv:")
print.listof(dftest3)

- Montgomery Clift

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ernest A · Accepted Answer

我无法复现这种行为，使用的是R版本3.3.3（Linux）。

> data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
   dec char  hex
1  193    Á 00C1
2  269    č 010D
3  282    Ě 011A
4  268    Č 010C
5  262    Ć 0106
6  263    ć 0107
7  348    Ŝ 015C
8  349    ŝ 015D
9  350    Ş 015E
10 352    Š 0160
11 353    š 0161