如何检测read.csv的正确编码？

Question

如何检测read.csv的正确编码？

65

我有这个文件（http://b7hq6v.alterupload.com/en/），想用read.csv在R中读取。但我无法检测到正确的编码方式，似乎是一种UTF-8编码。我使用的是WindowsXP机器上的R 2.12.1版本。有什么帮助吗？

- Alex

7个回答

58

包含函数guess_encoding的软件包readr，https://cran.r-project.org/web/packages/readr/readr.pdf，可以计算文件使用多种编码格式的概率：

guess_encoding("your_file", n_max = 1000)

- Enrique Pérez Herrero

3

这个选项非常好用且易于操作。 - MadmanLee

有时候 guess_encoding 无法提供明确的结果。我尝试在11个csv文件上运行它，其中有2个文件被分成了几种编码方式。 - Dutschke

1

这解决了我的问题：我有一个包含大量shapefile文件的目录，这些文件使用不同的编码方式，在Mac上使用st_read函数时无法正确识别。

check_encoding <- readr::guess_encoding('shapefile path', n_max = 1000);

if(nrow(check_encoding)>0) {
    tmp_shp <- st_read('shapefile path', options = paste0("ENCODING=", toupper(check_encoding$encoding[[1]])))
} else {
    tmp_shp <- st_read(here::here('shapefile path')
}

- undefined

7

首先，您需要确定文件的编码，这在R中可能无法完成（至少我所知道的）。您可以使用外部工具来完成这个操作，如Perl、python或Linux/UNIX下的file实用程序。

正如@ssmit建议的那样，您有一个UTF-16LE（Unicode）编码，在加载该文件时要使用这种编码，并使用readLines查看前10行内容：

> f <- file('encoding.asc', open="r", encoding="UTF-16LE")   # UTF-16LE, which is "called" Unicode in Windows
> readLines(f,10)
 [1] "\tFe 2\tZn\tO\tC\tSi\tMn\tP\tS\tAl\tN\tCr\tNi\tMo\tCu\tV\tNb 2\tTi\tB\tZr\tCa\tH\tCo\tMg\tPb 2\tW\tCl\tNa 3\tAr"                                                                                                                          
 [2] ""                                                                                                                                                                                                                                         
 [3] "0\t0,003128\t3,82E-05\t0,0004196\t0\t0,001869\t0,005836\t0,004463\t0,002861\t0,02148\t0\t0,004768\t0,0003052\t0\t0,0037\t0,0391\t0,06409\t0,1157\t0,004654\t0\t0\t0\t0,00824\t7,63E-05\t0,003891\t0,004501\t0\t0,001335\t0,01175"         
 [4] "0,0005\t0,003265\t3,05E-05\t0,0003662\t0\t0,001709\t0,005798\t0,004395\t0,002808\t0,02155\t0\t0,004578\t0,0002441\t0\t0,003601\t0,03897\t0,06406\t0,1158\t0,0047\t0\t0\t0\t0,008026\t6,10E-05\t0,003876\t0,004425\t0\t0,001343\t0,01157"  
 [5] "0,001\t0,003332\t2,54E-05\t0,0003052\t0\t0,001704\t0,005671\t0,0044\t0,002823\t0,02164\t0\t0,004603\t0,0003306\t0\t0,003611\t0,03886\t0,06406\t0,1159\t0,004705\t0\t0\t0\t0,008036\t5,09E-05\t0,003815\t0,004501\t0\t0,001246\t0,01155"   
 [6] "0,0015\t0,003313\t2,18E-05\t0,0002616\t0\t0,001678\t0,005689\t0,004447\t0,002921\t0,02171\t0\t0,004621\t0,0003488\t0\t0,003597\t0,03889\t0,06404\t0,1158\t0,004752\t0\t0\t0\t0,008022\t4,36E-05\t0,003815\t0,004578\t0\t0,001264\t0,01144"
 [7] "0,002\t0,003313\t2,18E-05\t0,0002834\t0\t0,001591\t0,005646\t0,00436\t0,003008\t0,0218\t0\t0,004643\t0,0003488\t0\t0,003619\t0,03895\t0,06383\t0,1159\t0,004752\t0\t0\t0\t0,008\t4,36E-05\t0,003771\t0,004643\t0\t0,001351\t0,01142"      
 [8] "0,0025\t0,003488\t2,18E-05\t0,000218\t0\t0,001657\t0,00558\t0,004338\t0,002986\t0,02175\t0\t0,004469\t0,0002616\t0\t0,00351\t0,03889\t0,06374\t0,1159\t0,004621\t0\t0\t0\t0,008131\t4,36E-05\t0,003771\t0,004708\t0\t0,001243\t0,01125"   
 [9] "0,003\t0,003619\t0\t0,0001526\t0\t0,001591\t0,005668\t0,004207\t0,00303\t0,02169\t0\t0,00449\t0,0002834\t0\t0,00351\t0,03874\t0,06383\t0,116\t0,004665\t0\t0\t0\t0,007956\t0\t0,003749\t0,004796\t0\t0,001286\t0,01125"                   
[10] "0,0035\t0,003422\t0\t4,36E-05\t0\t0,001482\t0,005711\t0,004185\t0,003292\t0,02156\t0\t0,004665\t0,0003488\t0\t0,003553\t0,03852\t0,06391\t0,1158\t0,004708\t0\t0\t0\t0,007717\t0\t0,003597\t0,004905\t0\t0,00133\t0,01136"

从这个例子可以看出，我们有一个标题和第二行为空行（read.table函数默认跳过），分隔符是\t，小数点为,。

> f <- file('encoding.asc', open="r", encoding="UTF-16LE")
> df <- read.table(f, sep='\t', dec=',', header=TRUE)

看看我们拥有什么：

> head(df)
       X     Fe.2       Zn         O C       Si       Mn        P        S
1 0.0000 0.003128 3.82e-05 0.0004196 0 0.001869 0.005836 0.004463 0.002861
2 0.0005 0.003265 3.05e-05 0.0003662 0 0.001709 0.005798 0.004395 0.002808
3 0.0010 0.003332 2.54e-05 0.0003052 0 0.001704 0.005671 0.004400 0.002823
4 0.0015 0.003313 2.18e-05 0.0002616 0 0.001678 0.005689 0.004447 0.002921
5 0.0020 0.003313 2.18e-05 0.0002834 0 0.001591 0.005646 0.004360 0.003008
6 0.0025 0.003488 2.18e-05 0.0002180 0 0.001657 0.005580 0.004338 0.002986
       Al N       Cr        Ni Mo       Cu       V    Nb.2     Ti        B Zr
1 0.02148 0 0.004768 0.0003052  0 0.003700 0.03910 0.06409 0.1157 0.004654  0
2 0.02155 0 0.004578 0.0002441  0 0.003601 0.03897 0.06406 0.1158 0.004700  0
3 0.02164 0 0.004603 0.0003306  0 0.003611 0.03886 0.06406 0.1159 0.004705  0
4 0.02171 0 0.004621 0.0003488  0 0.003597 0.03889 0.06404 0.1158 0.004752  0
5 0.02180 0 0.004643 0.0003488  0 0.003619 0.03895 0.06383 0.1159 0.004752  0
6 0.02175 0 0.004469 0.0002616  0 0.003510 0.03889 0.06374 0.1159 0.004621  0
  Ca H       Co       Mg     Pb.2        W Cl     Na.3      Ar
1  0 0 0.008240 7.63e-05 0.003891 0.004501  0 0.001335 0.01175
2  0 0 0.008026 6.10e-05 0.003876 0.004425  0 0.001343 0.01157
3  0 0 0.008036 5.09e-05 0.003815 0.004501  0 0.001246 0.01155
4  0 0 0.008022 4.36e-05 0.003815 0.004578  0 0.001264 0.01144
5  0 0 0.008000 4.36e-05 0.003771 0.004643  0 0.001351 0.01142
6  0 0 0.008131 4.36e-05 0.003771 0.004708  0 0.001243 0.01125

- daroczig

1

谢谢，它有效。但是为什么我必须跳过前两行？为什么这不能直接在read.csv中工作？ - Alex

2

@user590885：你是对的，skip=2可以省略（我根据这个编辑了我的答案），第二个空行将被跳过。您还可以使用read.csv函数读取此文件（使用相同的参数），但由于您的文件不是用逗号而是用制表符分隔的，所以我认为它不会很好看。查找?read.table以获取有关函数相似性的详细信息（默认值中的差异）。 - daroczig

2

@enrique-pérez-herrero的回答很好。使用guess_encoding("your_file", n_max = 1000)可以获得最可能的编码方式。然后，您可以使用该编码方式读取文件：readr::read_csv(file_path, locale = locale(encoding = "ENCODING_CODE")) 为了完整起见，这里有一个函数，它尝试使用所有潜在的编码方式读取文件，并输出包含两个数据框的列表：

嵌套数据框，其中包含所有工作编码的文件内容
最有可能的数据框。

detect_file_encoding <- function(file_path) {
  
  library(cli)
  library(dplyr)
  library(purrr)
  library(readr)
  library(stringi)
  
  
  # Read file in UTF-8 and detect encodings present
  file_raw = readr::read_file(file_path, locale = locale(encoding = "UTF-8"))
  encodings_found = stringi::stri_enc_detect(file_raw)
  
  # Function to read the file using all the encodings found
  try_all_encodings <- function(file_path, ENCODING) {
    
    FILE = read_file(file_path, locale = locale(encoding = ENCODING))
    HAS_BAD_CHARS = grepl("\u0086", FILE)
    
    if (!HAS_BAD_CHARS) {
      tibble(encoding = ENCODING, 
            content_file = list(FILE))
    } else {
      tibble(encoding = ENCODING, 
             content_file = list("BAD_CHARS detected"))
    }

  }

  # Safe version of function  
  try_all_encodings_safely = safely(try_all_encodings)
  
  # Loop through all the encodings
  OUT = 1:length(encodings_found[[1]]$Encoding) %>% 
    purrr::map(~ try_all_encodings_safely(file_path, encodings_found[[1]]$Encoding[.x]))

  # Create nested clean tibble with all the working encodings and contents 
  OUT_clean = 1:length(OUT) %>% purrr::map(~ OUT[[.x]]$result) %>% dplyr::bind_rows() %>% dplyr::left_join(encodings_found[[1]] %>% dplyr::as_tibble(), by = c("encoding" = "Encoding"))
   
  # Read file with the most likely working encoding
  DF_proper_encoding = suppressMessages(readr::read_csv(file_path, skip = 12, locale = locale(encoding = encodings_found[[1]]$Encoding[1]), show_col_types = FALSE, name_repair = "unique"))

  # Output list
  OUT_final = list(OUT_clean = OUT_clean,
                   DF_proper_encoding = DF_proper_encoding)
  
  # Output message
  cli::cli_alert_info("Found {nrow(OUT_clean)} potential encodings: {paste(OUT_clean$encoding)} \n - DF_proper_encoding stored using {OUT_clean$encoding[1]}")
  
  return(OUT_final)
}

- Gorka

2

除了使用readr包外，您也可以选择使用stringi::stri_enc_detect2。如果已知语言环境并且处理的是某种形式的UTF或ASCII，则此函数特别有效："..经验表明，如果提供UTF-*文本，则stri_enc_detect2比基于ICU的方法[由guess_encoding使用的stringi::stri_enc_detect]更好。"

有关stringi::stri_enc_detect的详细信息，请参见此处。

有关stringi::stri_enc_detect2的详细信息，请参见此处。 guess_encoding的更改请求，请参见此处。

- ElToro1966

1

该文件采用UTF-16LE编码，并带有BOM（字节顺序标记）。您应该使用encoding = "UTF-16LE"

- ssmir

4

为确保回答完整性：在read.table函数中，正确的参数应为fileEncoding。请注意，翻译不会添加解释或其他内容，只会尽可能使句子更通俗易懂而不改变原意。 - Marek

1

我对@marek的解决方案进行了整理更新，因为2020年我也遇到了同样的问题:

#Libraries
library(magrittr)
library(purrr)

#Make a vector of all the encodings supported by R
encodings <- set_names(iconvlist(), iconvlist())
#Make a simple reader function
reader <- function(encoding, file) {
  read.csv(file, fileEncoding = encoding, nrows = 3, header = TRUE)
}
#Create a "safe" version so we only get warnings, but errors don't stop it
# (May not always be necessary)
safe_reader <- safely(reader)

#Use the safe function with the encodings and the file being interrogated
map(encodings, safe_reader, `<TEST FILE LOCATION GOES HERE>`) %>%
  #Return just the results
  map("result") %>%
  #Keep only results that are dataframes
  keep(is.data.frame) %>%
  #Keep only results with more than one column
    #This predicate will need to change with the data
    #I knew this would work, because I could open in a text editor
  keep(~ ncol(.x) > 1) %>%
  #Return the names of the encodings
  names()

- Jason Mercer

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marek · Accepted Answer

首先，基于 StackOverflow 上一个更普遍的问题（How can I detect the encoding/codepage of a text file?），不能百分之百确定文件的编码。

我也经历过这个问题，最终找到了非自动化的解决方案：

使用 iconvlist 获取所有可能的编码：

codepages <- setNames(iconvlist(), iconvlist())

然后使用它们中的每一个来读取数据。

x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
                   fileEncoding=enc,
                   nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here

重要的是了解文件的结构（分隔符，标题）。使用fileEncoding参数设置编码。只读取几行。
现在您可以查看结果：

unique(do.call(rbind, sapply(x, dim)))
#        [,1] [,2]
# 437       14    2
# CP1200     3   29
# CP12000    0    1

看起来正确的是有3行29列的那个，让我们来看一下：

maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
#    CP1200    UCS-2LE     UTF-16   UTF-16LE      UTF16    UTF16LE 
#  "CP1200"  "UCS-2LE"   "UTF-16" "UTF-16LE"    "UTF16"  "UTF16LE"

你也可以查看数据。

x[maybe_ok]

对于您的文件，所有这些编码返回的数据都是相同的（部分原因是因为您可以看到有些冗余）。

如果您不了解文件的具体情况，需要在工作流程中进行一些更改，例如使用readLines（不能使用fileEncoding，必须使用length替代dim，并进行更多操作以找到正确的编码）。