使用Python更快地压缩文件，是否可以使用gzip？

Question

使用Python更快地压缩文件，是否可以使用gzip？

3

我尝试使用Python更快地压缩文件，因为我有些文件大小仅为30 MB，而有些则达到了4 GB。

是否有比下面的方法更有效的创建 gzip 文件的方法？是否有一种优化方式，使得如果文件小到足以放入内存中，它可以直接读取整个文件块，而不是逐行读取？

with open(j, 'rb') as f_in:
    with gzip.open(j + ".gz", 'wb') as f_out:
        f_out.writelines(f_in)

- godzilla

3个回答

0

下面是2种几乎相同的读取gzip文件的方法：

A.) 将所有内容加载到内存中 - 对于非常大的文件（几个GB），这可能不是一个好选择，因为您可能会耗尽内存。
B.) 不要将所有内容一次性加载到内存中，逐行读取 - 适用于大文件。

改编自 https://codebright.wordpress.com/2011/03/25/139/ 和 https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/ http://pastebin.com/dcEJRs1i

import sys
if sys.version.startswith("3"):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

A.)

def yield_line_gz_file(fn):
    """
    :param fn: String (absolute path)
    :return: GeneratorFunction (yields String)
    """
    ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
    fh = io_method(ph.communicate()[0])
    for line in fh:
        yield line

B.)

def yield_line_gz_file(fn):
    """
    :param fn: String (absolute path)
    :return: GeneratorFunction (yields String)
    """
    ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
    for line in ph.stdout:
        yield line

- tryptofame

酷，你有时间安排吗？ :) - Roelant

@Roelant 不是，但我认为版本A更快。 - tryptofame

0

不必逐行阅读，可以一次性阅读。例如：

import gzip
with open(j, 'rb') as f_in:
    content = f_in.read()
f = gzip.open(j + '.gz', 'wb')
f.write(content)
f.close()

- Amit

"f.write(file_content)能够用来创建压缩文件吗？" - godzilla

抱歉我误读了你的问题。你可以一次将文件读入到内容中，然后一起写入。 - Amit

但是如果j很大，你可能会耗尽内存。4GB对RAM来说是相当大的负担。 - Davidmh

这样做效果更好，那么在迭代过程中读取块并写入gzip怎么样？ - godzilla

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tdelaney · Accepted Answer

使用shutil.copyfileobj()函数以更大的块大小复制文件。在这个例子中，我使用了16MiB的块大小，这是相当合理的。

MEG = 2**20
with open(j, 'rb') as f_in:
    with gzip.open(j + ".gz", 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out, length=16*MEG)

你可能会发现对于大文件，特别是如果你打算并行压缩多个文件，调用gzip更快。