在Python中计算文件的CRC

Question

在Python中计算文件的CRC

35

我希望能够计算文件的循环冗余校验(CRC)，并获得类似于E45A12AC的输出。这是我的代码：

#!/usr/bin/env python 
import os, sys
import zlib

def crc(fileName):
    fd = open(fileName,"rb")
    content = fd.readlines()
    fd.close()
    for eachLine in content:
        zlib.crc32(eachLine)

for eachFile in sys.argv[1:]:
    crc(eachFile)

这个代码计算了每一行的CRC，但它的输出结果（例如 -1767935985）并不是我想要的。

Hashlib的工作方式符合我的要求，但它计算的是md5：

import hashlib
m = hashlib.md5()
for line in open('data.txt', 'rb'):
    m.update(line)
print m.hexdigest()

使用zlib.crc32是否可能获得类似的东西？

- user203547

10个回答

22

根据kobor42的回答进行修改，通过读取固定大小的块而不是"行"来提高性能2-3倍：

import zlib

def crc32(fileName):
    with open(fileName, 'rb') as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

返回的字符串包括前导零。

- CrouZ

15

hashlib兼容的CRC-32支持接口：

import zlib
class crc32(object):
    name = 'crc32'
    digest_size = 4
    block_size = 1
def __init__(self, arg=''):
        self.__digest = 0
        self.update(arg)
def copy(self):
        copy = super(self.__class__, self).__new__(self.__class__)
        copy.__digest = self.__digest
        return copy
def digest(self):
        return self.__digest
def hexdigest(self):
        return '{:08x}'.format(self.__digest)
def update(self, arg):
        self.__digest = zlib.crc32(arg, self.__digest) & 0xffffffff
# 现在您可以定义hashlib.crc32 = crc32
import hashlib
hashlib.crc32 = crc32
# Python > 2.7: hashlib.algorithms += ('crc32',)
# Python > 3.2: hashlib.algorithms_available.add('crc32')

- Paulo Freitas

9

如果要将任何整数的最低32位显示为8个十六进制数字（不带符号），可以通过对该值与由32位全部为1的掩码进行按位与运算，然后应用格式化来实现。例如：

>>> x = -1767935985
>>> format(x & 0xFFFFFFFF, '08x')
'969f700f'

无论您要格式化的整数是来自zlib.crc32还是其他任何计算，都不太相关。

- Alex Martelli

1

关于格式的观点很好，但看起来他的代码也没有计算出他想要的结果。这里真正存在两个问题：1）计算文件的CRC。2）将CRC值显示为十六进制。 - Jason Sundram

不仅如此，但是相比于kobor24的答案提供的"%X"%(x & 0xFFFFFFFF)，format更慢。但是看到另一种方法也很好，我以前从未使用过format。 - leetNightshade

9

使用海象运算符的 Python 3.8+：

import zlib

def crc32(filename, chunksize=65536):
    """Compute the CRC-32 checksum of the contents of the given filename"""
    with open(filename, "rb") as f:
        checksum = 0
        while (chunk := f.read(chunksize)) :
            checksum = zlib.crc32(chunk, checksum)
        return checksum

chunksize是每次从文件中读取的字节数。无论将chunksize设置为多少（必须大于0），对于相同的文件，您都将得到相同的CRC，但将其设置得太低可能会使代码变慢，而将其设置得太高可能会使用过多的内存。

结果是一个32位整数。空文件的CRC-32校验和为0。

- user3064538

4

编辑以包含以下Altren的解决方案。

这是CrouZ答案的修改版本，使用for循环和文件缓冲区，更加紧凑并略微提高了性能：

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

在搭载6700k和HDD的情况下的结果：

（注：已多次重新测试，速度保持一致。）

Warming up the machine...
Finished.

Beginning tests...
File size: 90288KB
Test cycles: 500

With for loop and buffer.
Result 45.24728019630359 

CrouZ solution
Result 45.433838356097894 

kobor42 solution
Result 104.16215688703986 

Altren solution
Result 101.7247863946586

使用以下脚本测试Python 3.6.4 x64：

import os, timeit, zlib, random, binascii

def forLoopCrc(fpath):
    """With for loop and buffer."""
    crc = 0
    with open(fpath, 'rb', 65536) as ins:
        for x in range(int((os.stat(fpath).st_size / 65536)) + 1):
            crc = zlib.crc32(ins.read(65536), crc)
    return '%08X' % (crc & 0xFFFFFFFF)

def crc32(fileName):
    """CrouZ solution"""
    with open(fileName, 'rb') as fh:
        hash = 0
        while True:
            s = fh.read(65536)
            if not s:
                break
            hash = zlib.crc32(s, hash)
        return "%08X" % (hash & 0xFFFFFFFF)

def crc(fileName):
    """kobor42 solution"""
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

def crc32altren(filename):
    """Altren solution"""
    buf = open(filename,'rb').read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

fpath = r'D:\test\test.dat'
tests = {forLoopCrc: 'With for loop and buffer.', 
     crc32: 'CrouZ solution', crc: 'kobor42 solution',
         crc32altren: 'Altren solution'}
count = 500

# CPU, HDD warmup
randomItm = [x for x in tests.keys()]
random.shuffle(randomItm)
print('\nWarming up the machine...')
for c in range(count):
    randomItm[0](fpath)
print('Finished.\n')

# Begin test
print('Beginning tests...\nFile size: %dKB\nTest cycles: %d\n' % (
    os.stat(fpath).st_size/1024, count))
for x in tests:
    print(tests[x])
    start_time = timeit.default_timer()
    for c in range(count):
        x(fpath)
    print('Result', timeit.default_timer() - start_time, '\n')

它更快，因为 for 循环比 while 循环更快（来源：这里和这里）。

- Polemos

2

将上述两个代码合并如下：

try:
    fd = open(decompressedFile,"rb")
except IOError:
    logging.error("Unable to open the file in readmode:" + decompressedFile)
    return 4
eachLine = fd.readline()
prev = 0
while eachLine:
    prev = zlib.crc32(eachLine, prev)
    eachLine = fd.readline()
fd.close()

- sunsys

2

使用binascii计算CRC有更快速、更紧凑的方法：

import binascii

def crc32(filename):
    buf = open(filename,'rb').read()
    hash = binascii.crc32(buf) & 0xFFFFFFFF
    return "%08X" % hash

- Altren

0

You can use base64 for getting out like [ERD45FTR]. And zlib.crc32 provides update options.

import os, sys
import zlib
import base64

def crc(fileName): fd = open(fileName,"rb") content = fd.readlines() fd.close() prev = None for eachLine in content: if not prev: prev = zlib.crc32(eachLine) else: prev = zlib.crc32(eachLine, prev) return prev

对于sys.argv[1:]中的每个文件： print base64.b64encode(str(crc(eachFile)))

- bhups

感谢您提供的语法。我得到了LTc3NzI0ODI2，但我想要E45A12AC（8位数字）。尝试过base32、base16。 - user203547

0

解决方案：

import os, sys
import zlib

def crc(fileName, excludeLine="", includeLine=""):
  try:
        fd = open(fileName,"rb")
  except IOError:
        print "Unable to open the file in readmode:", filename
        return
  eachLine = fd.readline()
  prev = None
  while eachLine:
      if excludeLine and eachLine.startswith(excludeLine):
            continue   
      if not prev:
        prev = zlib.crc32(eachLine)
      else:
        prev = zlib.crc32(eachLine, prev)
      eachLine = fd.readline()
  fd.close()    
  return format(prev & 0xFFFFFFFF, '08x') #returns 8 digits crc

for eachFile in sys.argv[1:]:
    print crc(eachFile)

不太清楚(excludeLine="", includeLine="")的作用是什么...

- user203547

2

我知道这很古老，但我还是要解释一下。我给你投了反对票，因为我认为发布你不理解的代码是没有用的。 - datashaman

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- kobor42 · Accepted Answer

更为简洁和优化的代码。

def crc(fileName):
    prev = 0
    for eachLine in open(fileName,"rb"):
        prev = zlib.crc32(eachLine, prev)
    return "%X"%(prev & 0xFFFFFFFF)

PS2: 旧的 PS 已经被弃用并删除了，因为评论中有相关建议。谢谢您。我不知道怎么错过了这个，但确实非常好。