在Bash中将CSV文件读入字典/关联数组

Question

在Bash中将CSV文件读入字典/关联数组

excelbashcsvassociative-arraycarriage-return

3

我正在尝试将一个csv文件读入到Bash关联数组中，但是得到的结果与我的期望不符。

使用Bash 5.0.18版本

Bellum:fox3-api rocky$ bash --version
GNU bash, version 5.0.18(1)-release (x86_64-apple-darwin19.5.0)

foobar.csv的内容

Bellum:scripts rocky$ cat ./foobar.csv
foo-1,bar-1
foo-2,bar-2
foo-3,bar-3

problem.sh的内容

#!/usr/bin/env bash

declare -A descriptions
while IFS=, read name title; do
      echo "I got:$name|$title"
      descriptions[$name]=$title
done < foobar.csv

echo ${descriptions["foo-1"]}
echo ${descriptions["foo-2"]}
echo ${descriptions["foo-3"]}

problem.sh 的实际输出

Bellum:scripts rocky$ ./problem.sh
I got:foo-1|bar-1
I got:foo-2|bar-2

bar-2

Bellum:scripts rocky$

期望输出：

I got:foo-1|bar-1
I got:foo-2|bar-2
I got:foo-3|bar-3    
bar-1
bar-2
bar-3

评论请求输出

    Bellum:scripts rocky$ head -n 1 ./foobar.csv | hexdump -C
    00000000  ef bb bf 66 6f 6f 2d 31  2c 62 61 72 2d 31 0d 0a  |...foo-1,bar-1..|
    00000010

    Bellum:scripts rocky$ od -c foobar.csv
    0000000  357 273 277   f   o   o   -   1   ,   b   a   r   -   1  \r  \n
    0000020    f   o   o   -   2   ,   b   a   r   -   2  \r  \n   f   o   o
    0000040    -   3   ,   b   a   r   -   3
    0000050

Cyrus的dos2unix修改

    #!/usr/bin/env bash
    
    declare -A descriptions
    dos2unix < foobar.csv | while IFS=, read name title; do
          echo "I got:$name|$title"
          descriptions[$name]=$title
    done
    
    echo ${descriptions["foo-1"]}
    echo ${descriptions["foo-2"]}
    echo ${descriptions["foo-3"]}

Cyrus的dos2unix更改的输出

    Bellum:scripts rocky$ ./problem.sh
    I got:foo-1|bar-1
    I got:foo-2|bar-2
    
    
    
    
    Bellum:scripts rocky$

这个CSV文件是通过在Microsoft Excel中另存为CSV格式而在Mac上创建的。感谢提前任何见解。

混合解决方案

对于未来的读者，这个问题实际上有两个问题。首先，保存CSV文件时使用了Microsoft Excel for Mac工作簿。我选择 "CSV UTF-8" 格式（Excel下拉菜单中列出的第一个CSV文件格式）进行“另存为”。这会添加附加字节，这些字节会破坏bash中的read命令。有趣的是，这些字节在cat命令中不会显示（请参阅原始帖子中的问题描述）。改为在Excel中选择“逗号分隔值”(在格式的下拉列表中更靠后)，就可以解决第一个问题。

其次，@Léa Gris和@glenn jackman指导我使用脚本修改符，以处理Excel文件中存在的一些换行符和回车符。

感谢大家。我花了整整一天的时间来解决这个问题。教训：我应该更早地求助于Stackoverflow。

- dmjones

1

你的代码对我有效；我很好奇数组里面到底有什么 => 在 while 循环后添加 typeset -p descriptions 以查看完整的数组定义；也可以验证数据文件的内容 => od -c foobar.csv，然后检查输出是否有除 \n 之外的非打印字符。 - markp-fuso

将 head -n 1 ./foobar.csv | hexdump -C 的输出添加到您的问题中（无注释）。 - Cyrus

我添加了markp-fuso和Cyrus所请求的输出以及创建csv文件的描述。 - dmjones

1

就此而言，找到了一些关于 357 273 277 的结果 - 看起来是“UTF-8字节顺序标记”；如果无法在从Excel保存文件时消除它，则有几个去除它的想法：这个和这个。 - markp-fuso

1

请看：如何从UTF-8文件中删除BOM？ - Léa Gris

显示剩余2条评论

3个回答

3

无论您的输入文件是Unix格式还是DOS格式，无论UTF-8 BOM标记是否存在，以及最后一行是否在文件结束之前带有换行符，此操作都适用于您的输入文件。

#!/usr/bin/env bash

declare -A descriptions
# IFS=$',\r\n' allow to capture either Unix or DOS Newlines
# read -r warrant not to expand \ escaped special characters
# || [ "$name" ] will make sure to capture last line
# even if it does not end with a newline marker
while IFS=$',\r\n' read -r name title || [ "$name" ]; do
      echo "I got:$name|$title"
      descriptions[$name]=$title
done < <(
  # Filter-out UTF-8 BOM if any
  sed $'1s/^\357\353\277//' foobar.csv
)

echo "${descriptions["foo-1"]}"
echo "${descriptions["foo-2"]}"
echo "${descriptions["foo-3"]}"

# A shorter option for debug, is to dump the variable as a declaration
typeset -p descriptions

现在有一种非常紧凑的方法，可以将CSV一次性转换为关联数组。

#!/usr/bin/env bash

# shellcheck disable=SC2155 # Safe generated assignment with printf %q
declare -A descriptions="($(
  # Collect all values from file into an array
  IFS=$'\r\n,' read -r -d '' -a elements < <(
    # Discard the UTF-8 BOM from the input file if any
    sed $'1s/^\357\353\277//' foobar.csv
  )
  # Format the elements into an Associative array declaration [key]=value 
  printf '[%q]=%q ' "${elements[@]}"
))"

echo "${descriptions["foo-1"]}"
echo "${descriptions["foo-2"]}"
echo "${descriptions["foo-3"]}"

# A shorter option for debug, is to dump the variable as a declaration
typeset -p descriptions

- Léa Gris

这非常有帮助。它解决了一半的问题，我很感激您展示完整的工作脚本。唯一没有解决的问题是来自Excel的输入文件问题（请参见原始帖子底部的编辑）。非常感谢！ - dmjones

@dmjones 我已经添加了自动删除BOM的功能，所以您不必担心它会被创建： - Léa Gris

你太棒了。 - dmjones

1

问题在于前三个字节，你可以使用以下方法将其删除：

dd bs=1 skip=3 if=foobar.csv of=foobar2.csv

并尝试使用foobar2.csv

- Philippe

你的评论关于前三个字节是正确的。后来我确定了是什么原因导致的。感谢你的帮助。 - dmjones

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- glenn jackman · Accepted Answer

以下是为什么您无法获得预期输出的原因：

    Bellum:scripts rocky$ od -c foobar.csv
    0000000  357 273 277   f   o   o   -   1   ,   b   a   r   -   1  \r  \n
    0000020    f   o   o   -   2   ,   b   a   r   -   2  \r  \n   f   o   o
    0000040    -   3   ,   b   a   r   -   3
    0000050

the name on first line does not contain just "foo-1" -- there are extra characters in there.
- They can be removed with "${name#$'\357\273\277'}"
the last line does not end with a newline, so the while-read loop only iterates twice.
- read returns non-zero if it can't read a whole line, even if it reads some characters.
- since read returns "false", the while loop ends.
- this can be worked around by using:
```
while IFS=, read -r name title || [[ -n $title ]]; do ... 
#............................. ^^^^^^^^^^^^^^^^^^ 
```
- or, just fix the file.

结果：

BOM=$'\357\273\277'
CR=$'\r'

declare -A descriptions
while IFS=, read name title || [[ $title ]]; do
  descriptions["${name#$BOM}"]=${title%$CR}
done < foobar.csv

declare -p descriptions
echo "${descriptions["foo-1"]}"
echo "${descriptions["foo-2"]}"
echo "${descriptions["foo-3"]}"

declare -A descriptions=([foo-1]="bar-1" [foo-2]="bar-2" [foo-3]="bar-3" )
bar-1
bar-2
bar-3