如何同时计算多个哈希值？

Question

如何同时计算多个哈希值？

3

我希望能够通过多进程计算同一文件的多个哈希值，以节省时间。

从我的观察来看，从固态硬盘读取文件相对较快，但哈希计算几乎要慢4倍。如果我想计算2个不同的哈希（md5和sha），那么速度就会慢8倍。我想能够在不同的处理器核心上并行计算不同的哈希（最多4个，取决于设置），但不知道如何避开全局解释器锁（GIL）。

这是我的当前代码（hash.py）：

import hashlib
from io import DEFAULT_BUFFER_SIZE

file = 'test/file.mov' #50MG file

def hash_md5(file):
    md5 = hashlib.md5()
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            md5.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return md5.hexdigest()

def hash_sha(file):
    sha = hashlib.sha1()
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            sha.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return sha.hexdigest()

def hash_md5_sha(file):
    md5 = hashlib.md5()
    sha = hashlib.sha1()
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            md5.update(chunk)
            sha.update(chunk)
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return md5.hexdigest(), sha.hexdigest()

def read_file(file):
    with open(file, mode='rb') as fl:
        chunk = fl.read(DEFAULT_BUFFER_SIZE)
        while chunk:
            chunk = fl.read(DEFAULT_BUFFER_SIZE)
    return

我做了一些测试，以下是结果：

from hash import *
from timeit import timeit
timeit(stmt='read_file(file)',globals=globals(),number = 100)
1.6323043460000122
>>> timeit(stmt='hash_md5(file)',globals=globals(),number = 100)
8.137973076999998
>>> timeit(stmt='hash_sha(file)',globals=globals(),number = 100)
7.1260356809999905
>>> timeit(stmt='hash_md5_sha(file)',globals=globals(),number = 100)
13.740918666999988

这个结果应该是一个函数，主脚本将迭代文件列表，并应为不同的文件（从1到4）检查不同的哈希值。有任何想法如何实现？

- Andrey Valentsov

2

你可以使用concurrent.futures类中的PoolProcessExecutor()方法。我相信这将帮助你实现你想要的功能。你可以在这里找到该库的更多详细信息：concurrent.futures。 - undefined

请参阅同时计算多个摘要（md5、sha256）的方法？ - undefined

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mihail Feraru · Answer 1

正如有人在评论中提到的那样，你可以使用concurrent.futures。我进行了一些基准测试，最高效的方法是使用ProcessPoolExecutor。以下是一个示例：

executor = ProcessPoolExecutor(4)
executor.map(hash_function, files)
executor.shutdown()

如果你想看一下我的基准测试，你可以在这里找到它们here以及结果：

Total using read_file: 10.121980099997018
Total using hash_md5_sha: 40.49621040000693
Total (multi-thread) using read_file: 6.246223400000417
Total (multi-thread) using hash_md5_sha: 19.588415799999893
Total (multi-core) using read_file: 4.099713300000076
Total (multi-core) using hash_md5_sha: 14.448464199999762

我使用了40个每个300 MiB的文件进行测试。