在Python中解压嵌套的zip文件

Question

在Python中解压嵌套的zip文件

pythonzip

14

我正在寻找一种在Python中解压嵌套的zip文件的方法。例如，考虑以下结构（为方便起见，使用虚构名称）：

文件夹
- ZipfileA.zip
  - ZipfileA1.zip
  - ZipfileA2.zip
- ZipfileB.zip
  - ZipfileB1.zip
  - ZipfileB2.zip

......等等。我试图访问第二个zip文件中包含的文本文件。我绝对不想提取所有内容，因为数量巨大会使计算机崩溃（第一层有几百个zip文件，第二层每个zip文件都有近10,000个zip文件）。

我一直在尝试使用 'zipfile' 模块-我能够打开第一层的zip文件。例如：

zipfile_obj = zipfile.ZipFile("/Folder/ZipfileA.zip")
next_layer_zip = zipfile_obj.open("ZipfileA1.zip")

然而，这会返回一个 "ZipExtFile" 实例（不是文件或 zip 文件实例） - 我无法继续打开这种特定的数据类型。所以我做不到这一点：

data = next_layer_zip.open(data.txt)

然而我可以通过以下方式“读取”这个zip文件：

next_layer_zip.read()

但是这完全没有用！（即只能读取压缩数据/乱码）。

有人有任何想法吗（不使用 ZipFile.extract）？

我找到了这个 http://pypi.python.org/pypi/zip_open/ - 它看起来正好符合我的要求，但它对我来说似乎无法工作（对于我正在尝试处理的文件，使用该模块仍然出现“[Errno 2] No such file or directory:”错误）。

非常感谢任何想法！提前致谢。

- djmac

7个回答

8

我使用的是Python 3.7.3版本。

import zipfile
import io
with zipfile.ZipFile('all.zip') as z:
    with z.open('nested.zip') as z2:
        z2_filedata =  io.BytesIO(z2.read())
        with zipfile.ZipFile(z2_filedata) as nested_zip:
            print( nested_zip.open('readme.md').read())

- yutaka Kajiwara

8

对于那些正在寻求一种可以提取嵌套的zip文件（无论嵌套多少级）并清理原始zip文件的函数的人：

import zipfile, re, os

def extract_nested_zip(zippedFile, toFolder):
    """ Unzip a zip file and its contents, including nested zip files
        Delete the zip file(s) after extraction
    """
    with zipfile.ZipFile(zippedFile, 'r') as zfile:
        zfile.extractall(path=toFolder)
    os.remove(zippedFile)
    for root, dirs, files in os.walk(toFolder):
        for filename in files:
            if re.search(r'\.zip$', filename):
                fileSpec = os.path.join(root, filename)
                extract_nested_zip(fileSpec, root)

- ronnydw

我遇到了一个问题，即 os.remove 调用导致错误： [WinError 32] 进程无法访问文件，因为另一个进程正在使用它: 'zipfile.zip'将调用移动到循环后的 os.remove ，并且只在递归调用中调用它，解决了我的问题。 - Izaak Cornelis

7

很遗憾，解压zip文件需要随机访问存档，并且ZipFile方法（更不用说DEFLATE算法本身）只提供流。因此，无法在不提取它们的情况下解压嵌套的zip文件。

- Ignacio Vazquez-Abrams

6

这里是我想出来的一个函数。

def extract_nested_zipfile(path, parent_zip=None):
    """Returns a ZipFile specified by path, even if the path contains
    intermediary ZipFiles.  For example, /root/gparent.zip/parent.zip/child.zip
    will return a ZipFile that represents child.zip
    """

    def extract_inner_zipfile(parent_zip, child_zip_path):
        """Returns a ZipFile specified by child_zip_path that exists inside
        parent_zip.
        """
        memory_zip = StringIO()
        memory_zip.write(parent_zip.open(child_zip_path).read())
        return zipfile.ZipFile(memory_zip)

    if ('.zip' + os.sep) in path:
        (parent_zip_path, child_zip_path) = os.path.relpath(path).split(
            '.zip' + os.sep, 1)
        parent_zip_path += '.zip'

        if not parent_zip:
            # This is the top-level, so read from disk
            parent_zip = zipfile.ZipFile(parent_zip_path)
        else:
            # We're already in a zip, so pull it out and recurse
            parent_zip = extract_inner_zipfile(parent_zip, parent_zip_path)

        return extract_nested_zipfile(child_zip_path, parent_zip)
    else:
        if parent_zip:
            return extract_inner_zipfile(parent_zip, path)
        else:
            # If there is no nesting, it's easy!
            return zipfile.ZipFile(path)

这是我测试的方法：

echo hello world > hi.txt
zip wrap1.zip hi.txt
zip wrap2.zip wrap1.zip
zip wrap3.zip wrap2.zip

print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap2.zip/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap3.zip/wrap2.zip/wrap1.zip').open('hi.txt').read()

- Matt Faus

1

对于使用3.3的人们，为了节省时间，避免出现“TypeError: string argument expected, got 'bytes'”错误，需要注意以下代码行：memory_zip.write(parent_zip.open(child_zip_path).read())。目前还不确定解决方法。 - user25064

4

这对我有效。只需将此脚本与嵌套的zip文件放在同一个目录下即可。它还会计算嵌套的zip中的文件总数。

import os

from zipfile import ZipFile


def unzip (path, total_count):
    for root, dirs, files in os.walk(path):
        for file in files:
            file_name = os.path.join(root, file)
            if (not file_name.endswith('.zip')):
                total_count += 1
            else:
                currentdir = file_name[:-4]
                if not os.path.exists(currentdir):
                    os.makedirs(currentdir)
                with ZipFile(file_name) as zipObj:
                    zipObj.extractall(currentdir)
                os.remove(file_name)
                total_count = unzip(currentdir, total_count)
    return total_count

total_count = unzip ('.', 0)
print(total_count)

- Anqi777

0

我解决这样的问题的方法是，包括自分配对象：

import os
import re 
import zipfile
import pandas as pd
# import numpy as np
path = r'G:\Important\Data\EKATTE'

# DESCRIBE
archives = os.listdir(path)
archives = [ar for ar in archives if ar.endswith(".zip")]
contents = pd.DataFrame({'elec_date':[],'files':[]})
for a in archives:
    archive = zipfile.ZipFile( path+'\\'+a )
    filelist = archive.namelist()
    # archive.infolist()
    for i in archive.namelist():
        if re.match('.*zip', i):
            sub_arch = zipfile.ZipFile(archive.open(i))
            sub_names = [x for x in sub_arch.namelist()]
            for s in sub_names:
                exec(f"{s.split('.')[0]} = pd.read_excel(sub_arch.open(s), squeeze=True)")

该档案可以在保加利亚国家统计局网页上找到（直接链接）： https://www.nsi.bg/sites/default/files/files/EKATTE/Ekatte.zip

- Julian

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel W. Steinbrook · Accepted Answer

ZipFile需要一个类似文件的对象，所以您可以使用StringIO将从嵌套zip中读取的数据转换为这样的对象。但需要注意的是，这将在内存中加载完整的（仍然压缩的）内部zip。

with zipfile.ZipFile('foo.zip') as z:
    with z.open('nested.zip') as z2:
        z2_filedata = cStringIO.StringIO(z2.read())
        with zipfile.ZipFile(z2_filedata) as nested_zip:
            print nested_zip.open('data.txt').read()