Python3处理tar文件中的csv文件

4

我正在尝试处理包含在tar.gz文件中的csv文件,并且我遇到了传递正确数据/对象给csv模块的问题。

假设我有一个tar.gz文件,其中包含多个以下格式的csv文件。

1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38

我希望能够在不将tar文件中的每个csv文件解压并写入磁盘的情况下,直接在内存中访问每个csv文件。

例如:

import tarfile
import csv

tar = tarfile.open("tar-file.tar.gz")

for member in tar.getmembers():
    f = tar.extractfile(member).read()
    content = csv.reader(f)
    for row in content:
        print(row)
tar.close()

这将产生以下错误。

    for row in content:
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)

我也尝试了将f解析为字符串,就像csv模块文档中所述的那样。
content = csv.reader([f])

上述代码会产生相同的错误。
我尝试将文件对象f解析为ascii格式。
f = tar.extractfile(member).read().decode('ascii')

但这样会遍历每个CSV元素,而不是遍历包含元素列表的行。

['1']
['0']
['7']
['9']
['', '']
['S']
['A']
['M']
['P']
['L']
['E']
['_']
['A']
['', '']
['G']
['R']

snip...

['2']
['0']
['1']
['7']
['/']
['0']
['2']
['/']
['1']
['5']
[' ']
['2']
['2']
[':']
['5']
['7']
[':']
['3']
['8']
[]
[]

尝试将f作为 ASCII 解析并读取为字符串

f = tar.extractfile(member).read().decode('ascii')
content = csv.reader([f])

产生以下输出
    for row in content:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

为了展示不同的输出,我使用了以下代码。
import tarfile
import csv

tar = tarfile.open("tar-file.tar.gz")

for member in tar.getmembers():
    f = tar.extractfile(member).read()
    print(member.name)
    print('Raw :', type(f))
    print(f)
    print()
    f = f.decode('ascii')
    print('ASCII:', type(f))
    print(f)
tar.close()

这将产生以下输出(对于此示例,每个csv包含相同的数据)。
./raw_data/csv-file1.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'

ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38


./raw_data/csv-file2.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'

ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38


./raw_data/csv-file3.csv
Raw : <class 'bytes'>
b'1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30\n1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26\n1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31\n1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38\n\n'

ASCII: <class 'str'>
1079,SAMPLE_A,GROUP,001,,2017/02/15 22:57:30
1041,SAMPLE_B,GROUP,023,,2017/02/15 22:57:26
1077,SAMPLE_C,GROUP,005,,2017/02/15 22:57:31
1079,SAMPLE_A,GROUP,128,,2017/02/15 22:57:38

我该如何使csv模块能够正确地读取tar模块提供的内存文件?谢谢。

2个回答

5

您只需使用io.StringIO()创建一个类似文件的对象,供csv库使用。例如:

import tarfile
import csv
import io

with tarfile.open('input.rar') as tar:
    for member in tar:
        if member.isreg():      # Is it a regular file?
            print("{} - {} bytes".format(member.name, member.size))
            csv_file = io.StringIO(tar.extractfile(member).read().decode('ascii'))

            for row in csv.reader(csv_file):
                print(row)

0

这个问题再次被提出已经接近3年了。请注意,在python: use CSV reader with single file extracted from tarfile中,经过简短的讨论后,可以找到更好的解决方案:


import tarfile
import csv
import io

with tarfile.open('input.rar') as tar:
    for member in tar:
        if member.isreg():      # Is it a regular file?
            print("{} - {} bytes".format(member.name, member.size))
            csv_file = io.TextIOWrapper(tar.extractfile(member), encoding="utf-8")

            for row in csv.reader(csv_file):
                print(row)

对于较大的文件,TextIOWrapper表现更佳,因为它不需要一次性读取整个文件。相比之下,当执行tar.extractfile(member).read()时,完整的成员文件会被加载到内存中。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接