如何格式化CSV文件中的日期字段，其中包含多个逗号的字符串字段

Question

如何格式化CSV文件中的日期字段，其中包含多个逗号的字符串字段

3

我有一个名为file.csv的.CSV文件，其中所有数据都用双引号括起来。文件的示例格式如下：

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","13-OCT-11","232"

第9个字段是日期字段，格式为"DD-MMM-YY"。我需要将其转换为YYYY/MM/DD格式。我尝试使用下面的代码，但没有用。

awk -F, '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/"20"%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 print
 }' OFS="," file.csv > temp_file.csv

执行上述代码后，temp_file.csv文件的输出如下所示。

column1,column2,column3,column4,column5,column6,column7,Column8,00/00/2000,Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

根据我的理解，问题在于双引号中的逗号，因为我的代码也将其考虑在内... 请回答以下问题：

1）将所有字段的值都用双引号括起来是否有任何区别？如果有区别，如何除去其中除包含逗号的字符串之外的所有值中的双引号？ 2）是否需要修改我的代码以便将第9个字段的格式从"DD-MMM-YYYY"转换为YYYY/MM/DD？

- Dhruuv

你可以从结尾计数: NF-1 - kev

我会考虑使用一个专门用于操作CSV文件的程序，可能是csvfix。它内置了日期操作函数。 - Jonathan Leffler

@kev 我在上面的代码中如何使用 **NF-1**？我对 linux 和 awk 都很陌生。 @JonathanLeffler 我尝试了 man csvfix，但没有任何手册页面... 请详细说明如何让它工作... - Dhruuv

3个回答

1

您可以尝试以下一行代码：

awk '
BEGIN {
    FS = OFS = ","
    split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, / /)
    for (i=1; i<=12; i++) {
        mm[month[i]]=i
    }
}
NR>1 { 
    gsub(/\"/, "", $(NF-1))
    split($(NF-1), d, /-/)
    $(NF-1)=q "20" d[3] "/" mm[d[2]] "/" d[1] q}1' q='"' file

Output:

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"

- jaypal singh

1

你可以尝试使用这个 awk。

awk -F"\"" 'BEGIN { OFS="\"" }{ "date -d "$18" +%Y/%m/%d" | getline $18; print $0; }' yourfile.txt

输出：

"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2,name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,name","333","22","2011/10/13","232"

- sat

请问您能否解释一下这段代码吗？这里的"$18"代表什么意思？因为当我使用您的代码时，出现了一个错误“sh: +%Y/%m/%d: No such file or directory”，并且打印了我在问题中提到的相同输入文件内容。 - Dhruuv

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Hynek -Pichi- Vychodil · Accepted Answer

我强烈建议您使用适当的CSV解析器。例如，使用Perl中的Text::CSV_XS可以正确且合理地完成此工作。例如，以下一行代码：

perl -MText::CSV_XS -E'$csv=Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1});@m=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);@m{@m}=(1 .. @m);while(my $row=$csv->getline(ARGV)){($d,$m,$y)=split("-",$row->[8]);$row->[8]=sprintf"%02d/%02d/%04d",$d,$m{$m},$y if $m{$m};$csv->print(STDOUT, $row)}' file.csv > temp_file.csv