如何格式化CSV文件中的日期字段,其中包含多个逗号的字符串字段

3

我有一个名为file.csv的.CSV文件,其中所有数据都用双引号括起来。文件的示例格式如下:

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","13-OCT-11","232"

第9个字段是日期字段,格式为"DD-MMM-YY"。我需要将其转换为YYYY/MM/DD格式。我尝试使用下面的代码,但没有用。
awk -F, '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/"20"%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 print
 }' OFS="," file.csv > temp_file.csv

执行上述代码后,temp_file.csv文件的输出如下所示。
column1,column2,column3,column4,column5,column6,column7,Column8,00/00/2000,Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

根据我的理解,问题在于双引号中的逗号,因为我的代码也将其考虑在内... 请回答以下问题:
1)将所有字段的值都用双引号括起来是否有任何区别?如果有区别,如何除去其中除包含逗号的字符串之外的所有值中的双引号? 2)是否需要修改我的代码以便将第9个字段的格式从"DD-MMM-YYYY"转换为YYYY/MM/DD

你可以从结尾计数: NF-1 - kev
我会考虑使用一个专门用于操作CSV文件的程序,可能是csvfix。它内置了日期操作函数。 - Jonathan Leffler
@kev 我在上面的代码中如何使用 **NF-1**?我对 linuxawk 都很陌生。 @JonathanLeffler 我尝试了 man csvfix,但没有任何手册页面... 请详细说明如何让它工作... - Dhruuv
3个回答

2

我强烈建议您使用适当的CSV解析器。例如,使用Perl中的Text::CSV_XS可以正确且合理地完成此工作。例如,以下一行代码:

perl -MText::CSV_XS -E'$csv=Text::CSV_XS->new({eol=>"\n", allow_whitespace=>1});@m=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC);@m{@m}=(1 .. @m);while(my $row=$csv->getline(ARGV)){($d,$m,$y)=split("-",$row->[8]);$row->[8]=sprintf"%02d/%02d/%04d",$d,$m{$m},$y if $m{$m};$csv->print(STDOUT, $row)}' file.csv > temp_file.csv

1
您可以尝试以下一行代码:

awk '
BEGIN {
    FS = OFS = ","
    split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, / /)
    for (i=1; i<=12; i++) {
        mm[month[i]]=i
    }
}
NR>1 { 
    gsub(/\"/, "", $(NF-1))
    split($(NF-1), d, /-/)
    $(NF-1)=q "20" d[3] "/" mm[d[2]] "/" d[1] q}1' q='"' file

Output:

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"


1
你可以尝试使用这个 awk
awk -F"\"" 'BEGIN { OFS="\"" }{ "date -d "$18" +%Y/%m/%d" | getline $18; print $0; }' yourfile.txt

输出:

"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2,name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,name","333","22","2011/10/13","232"

请问您能否解释一下这段代码吗?这里的"$18"代表什么意思?因为当我使用您的代码时,出现了一个错误“sh: +%Y/%m/%d: No such file or directory”,并且打印了我在问题中提到的相同输入文件内容。 - Dhruuv

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接