CSV在理论上是一种简单的格式(由逗号分隔的表格数据),但遗憾的是没有正式的规范,因此存在许多细微差异的实现。这就需要在导入/导出时要小心处理。我将引用RFC 4180来描述“常见实现”:
2. Definition of the CSV Format
While there are various specifications and implementations for the
CSV format (for ex. [4], [5], [6] and [7]), there is no formal
specification in existence, which allows for a wide variety of
interpretations of CSV files. This section documents the format that
seems to be followed by most implementations:
1. Each record is located on a separate line, delimited by a line
break (CRLF). For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
2. The last record in the file may or may not have an ending line
break. For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx
3. There maybe an optional header line appearing as the first line
of the file with the same format as normal record lines. This
header will contain names corresponding to the fields in the file
and should contain the same number of fields as the records in
the rest of the file (the presence or absence of the header line
should be indicated via the optional "header" parameter of this
MIME type). For example:
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
4. Within the header and each record, there may be one or more
fields, separated by commas. Each line should contain the same
number of fields throughout the file. Spaces are considered part
of a field and should not be ignored. The last field in the
record must not be followed by a comma. For example:
aaa,bbb,ccc
5. Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
6. Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
通常情况下:
- 一个字段可以有也可以没有双引号包围。(2005年的RFC说Excel不使用双引号,但我测试了Excel 2016,它确实使用了。)
- 包含换行符(CRLF)、双引号和逗号的字段应该用双引号包围。(特别地,CSV文件可能有多行,因为在文本编辑器中显示的多行对应一行数据。)
- 如果使用双引号来包围字段,那么字段内出现的双引号必须通过在其前面加上另一个双引号进行转义。
- 因此,在原始CSV字段中,""表示空字符串,而""""表示单引号"。
(通常不是问题:CRLF(Windows风格)或LF(Unix风格)换行符;最后一行是否以换行符结束)
然而,您可能会遇到一些数据,它们使用转义字符(如
\
)来转义引号或其他字符(分隔符、换行符、转义字符本身)。例如,在readr的
read_csv()
函数中,可以通过
escape_double
和
escape_backslash
参数来控制这种情况。有些不寻常的数据可能使用注释字符,比如
#
(在R的
read.table
函数中是默认值,但在
read.csv
函数中不是)。
"world,"",hello"
- user4035