如何在 Python 中对字符串进行 gzip 压缩?

97

我该如何在Python中对字符串进行gzip压缩?

gzip.GzipFile 存在,但那只适用于文件对象 - 对于纯字符串呢?


1
@KevinDTimm,那份文档只提到了StringIO,但并没有真正解释如何使用它。因此,在这里提出这个问题是完全合理的,我认为。不过,在提问之前多做一些尝试并告诉我们结果会更好。 - Alfe
@Alfe - 这个问题在4年前被关闭,原因与我的评论类似 - OP没有先进行搜索。 - KevinDTimm
当然,你是正确的,@KevinDTimm。 - Alfe
4
这与主题有什么关系? - user636044
2
这个问题现在在谷歌上搜索“Python中的gzip字符串”排名第一,我认为非常合理。它应该重新开放。 - Garrett
2
就像上面所说,这个问题是谷歌搜索的最佳结果之一,其中一个答案是正确的 - 确实看起来不应该被关闭。 - darkdan21
6个回答

167

如果你想要生成一个完整的gzip兼容的二进制字符串,包括头文件等信息,你可以使用gzip.GzipFileStringIO一起使用:

try:
    from StringIO import StringIO  # Python 2.7
except ImportError:
    from io import StringIO  # Python 3.x
import gzip
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
  f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

# returns '\x1f\x8b\x08\x00\xbd\xbe\xe8N\x02\xff\x0b\xc9\xc8,V\x00\xa2\xdc\xcc\xecT\x85\xbc\xd2\xdc\xa4\xd4"\x85\xfc\xbcT\x1d\xa0X\x9ez\x89B\tH:Q!\'\xbfD!?M!\xad4\xcf\x1e\x00w\xd4\xea\xf41\x00\x00\x00'

2
相反的操作是:def gunzip_text(text): infile = StringIO.StringIO() infile.write(text) with gzip.GzipFile(fileobj=infile, mode="r") as f: f.rewind() out = f.read() return out - fastmultiplication
5
@fastmultiplication 的代码可以简化为: f = gzip.GzipFile(StringIO.StringIO(text)); result = f.read(); f.close(); return result。该代码使用gzip解压缩字符串并返回结果。 - Alfe
3
很抱歉,该问题已被关闭,因此我无法发布新答案,但是在这里是如何在Python 3中完成此操作的。 - Garrett
可能与此无关,将内存中的数据先压缩再写入本地磁盘会更快吗? - user3226167
1
在Python 3中:import zlib; my_string = "hello world"; my_bytes = zlib.compress(my_string.encode('utf-8')); my_hex = my_bytes.hex(); my_bytes2 = bytes.fromhex(my_hex); my_string2 = zlib.decompress(my_bytes); assert my_string == my_string2; - ostrokach
在3.7 iPython中复制和粘贴会出现“TypeError:string argument expected,got'bytes'”错误。 - Chazt3n

72

最简单的方法是使用zlib编码

compressed_value = s.encode("zlib")

然后你使用以下方式进行解压:

plain_string_again = compressed_value.decode("zlib")

1
@Daniel:是的,s 是 Python 2.x 中 str 类型的对象。 - Sven Marnach
2
查看标准编码以了解他从哪里得到的(向下滚动到“codecs”)。还可用:s.encode('rot13')s.encode('base64') - bobobobo
13
请注意,这种方法与gzip命令行实用程序不兼容。因为gzip会包括一个头部和校验和,而此机制只是将内容压缩。 - tylerl
9
Python 3在Unicode字符串(Python 3中的“str”类型)和字节字符串(Python 3中的“bytes”类型)之间的区别方面更为严格。str对象具有一个encode()方法,返回一个bytes对象,而bytes对象则具有一个decode()方法,返回一个str对象。 zlib编解码器是特例,它将从bytes转换为bytes,因此它不适合这种结构。您可以使用codecs.encode(b, "zlib")codecs.decode(b, "slib")来代替,其中b是一个bytes对象。 - Sven Marnach
1
注意。这个答案是错误的。它不会像问题中要求的那样压缩到gzip格式。 - Mark Adler
显示剩余3条评论

45

Python3版本的Sven Marnach 2011年的答案:

import gzip
exampleString = 'abcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijmortenpunnerudengelstadrocksklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuv123'
compressed_value = gzip.compress(bytes(exampleString, 'utf-8'))
plain_string_again = gzip.decompress(compressed_value).decode('utf-8')

3
在Python 3中,仍然使用zlib,而gzip实际上使用了zlib,参见:https://docs.python.org/3/library/zlib.html 和https://docs.python.org/3/library/gzip.html#module-gzip。 - gitaarik
1
我的原始答案使用了zlib。因为这是原始问题,所以改用gzip。您可以在我的示例中轻松地从gzip替换为zlib(搜索和替换),它将正常工作。 - Punnerud
2
gzip.decompress 返回字节,因此调用 plain_string_again.decode('utf-8') 以获取 str 对象。 - milan
与Sven Marnach的答案不同,这个答案是正确的,因为它生成gzip格式。 - Mark Adler

3

对于想要压缩Pandas数据框架为JSON格式的人:

已与Python 3.6和Pandas 0.23进行了测试。

import sys
import zlib, lzma, bz2
import math

def convert_size(size_bytes):
    if size_bytes == 0:
        return "0B"
    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

dataframe = pd.read_csv('...') # your CSV file
dataframe_json = dataframe.to_json(orient='split')
data = dataframe_json.encode()
compressed_data = bz2.compress(data)
decompressed_data = bz2.decompress(compressed_data).decode()
dataframe_aux = pd.read_json(decompressed_data, orient='split')

#Original data size:  10982455 10.47 MB
#Encoded data size:  10982439 10.47 MB
#Compressed data size:  1276457 1.22 MB (lzma, slow), 2087131 1.99 MB (zlib, fast), 1410908 1.35 MB (bz2, fast)
#Decompressed data size:  10982455 10.47 MB
print('Original data size: ', sys.getsizeof(dataframe_json), convert_size(sys.getsizeof(dataframe_json)))
print('Encoded data size: ', sys.getsizeof(data), convert_size(sys.getsizeof(data)))
print('Compressed data size: ', sys.getsizeof(compressed_data), convert_size(sys.getsizeof(compressed_data)))
print('Decompressed data size: ', sys.getsizeof(decompressed_data), convert_size(sys.getsizeof(decompressed_data)))

print(dataframe.head())
print(dataframe_aux.head())

1
马丁·托玛的答案几乎可以使用: 我必须像这个答案中提到的那样使用BytesIO。
from io import BytesIO # Python 3.x, haven't tested 2.7
import gzip
out = BytesIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
  f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

原始代码产生了一个“TypeError: string argument expected, got 'bytes'”错误。

-4
s = "a long string of characters"

g = gzip.open('gzipfilename.gz', 'w', 5) # ('filename', 'read/write mode', compression level)
g.write(s)
g.close()

6
我猜这个问题是关于在内存中压缩字符串而不必将其写入磁盘的过程。否则,你的回答完全正确。 - Alfe

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接