这段代码是一个 Django 应用程序中的简化版本,它通过 HTTP 多部分 POST 接收上传的 zip 文件,并对其中的数据进行只读处理:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
很简单。我们打开zip文件并打开其中一个或两个CSV文件。
奇怪的是,如果我使用较大的zip文件(约13 MB)并从StringIO.StringIO
或 io.BytesIO
实例化ZipFile
(可能是除普通文件名之外的任何内容?当我尝试使用TemporaryUploadedFile
或通过调用os.tmpfile()
和shutil.copyfileobj()
创建的文件对象来创建ZipFile
时,在Django应用程序中出现了类似的问题),并且打开两个CSV文件而不仅仅是一个,那么在处理的最后阶段会失败。这是我在Linux系统上看到的输出:
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
顺便提一下,在我的OS X系统上,代码在相同的条件下以不同的方式失败。它似乎读取了损坏的数据并变得非常混乱,而不是抛出BadZipFile
异常。
这一切都让我觉得我在代码中做了一些不应该做的事情——比如在已经打开同一zip文件对象中有另一个文件的情况下调用zipfile.open
?使用ZipFile(filename)
时似乎没有问题,但当传递类似于文件的对象给ZipFile
时,可能会出现问题,因为在zipfile
模块中存在一些实现细节吗?
也许我在zipfile
文档中漏掉了什么?或者可能还未记录?或者(最不可能),这是zipfile
模块中的一个错误?