Python中内部hash()函数的复杂度

10

我知道哈希表查找的平均情况是 O(1),但这是否包括计算给定输入的哈希值本身所需的时间?我已经尝试在谷歌上寻找答案,阅读了所有相关文档,但无法找到 Python 中内部函数 hash() 的具体实现。一些网站声称计算哈希值只需要恒定的时间,但也有人说它是 O(k),其中 k 是输入长度。如果您能帮助我找到正确的答案,我将不胜感激。提前感谢:)


2
如果你懂C语言,可以看一下pyhash.c源代码 - PM 2Ring
2个回答

5

这完全取决于被哈希的类型。以下是一些在CPython 2.7.13中进行的简单测试,但这不是唯一的选项:

>>> pprint.pprint([(i, timeit.timeit('hash(n)', setup='n="a"*{}'.format(6400*i), number=1)) for i in range(16)])
[(0, 1.9073486328125e-06),
 (1, 1.6927719116210938e-05),
 (2, 3.314018249511719e-05),
 (3, 4.887580871582031e-05),
 (4, 6.4849853515625e-05),
 (5, 8.106231689453125e-05),
 (6, 9.679794311523438e-05),
 (7, 0.00011301040649414062),
 (8, 0.00012993812561035156),
 (9, 0.00014495849609375),
 (10, 0.00016188621520996094),
 (11, 0.0001780986785888672),
 (12, 0.00019288063049316406),
 (13, 0.0002090930938720703),
 (14, 0.000225067138671875),
 (15, 0.00024199485778808594)]
>>> pprint.pprint([(i, timeit.timeit('hash(n)', setup='n="a"*{}'.format(6400*i))) for i in range(16)])
[(0, 0.09920382499694824),
 (1, 0.09032988548278809),
 (2, 0.09069585800170898),
 (3, 0.09006309509277344),
 (4, 0.09059309959411621),
 (5, 0.09033513069152832),
 (6, 0.09037399291992188),
 (7, 0.09031510353088379),
 (8, 0.09110498428344727),
 (9, 0.09074902534484863),
 (10, 0.0909719467163086),
 (11, 0.09081602096557617),
 (12, 0.09112405776977539),
 (13, 0.09091711044311523),
 (14, 0.09103798866271973),
 (15, 0.09085893630981445)]

请注意,对于新创建的字符串进行哈希处理的时间复杂度为O(n),但是每个字符串都会缓存其哈希值,因此在重复使用时(number=1000000timeit的默认值),它的时间复杂度会变为常数级别。
>>> pprint.pprint([(i, timeit.timeit('hash(n)', setup='n=2**{}'.format(64*i))) for i in range(16)])
[(0, 0.09280180931091309),
 (1, 0.09100484848022461),
 (2, 0.09413909912109375),
 (3, 0.09609699249267578),
 (4, 0.10647201538085938),
 (5, 0.1146399974822998),
 (6, 0.12569880485534668),
 (7, 0.1291029453277588),
 (8, 0.13350296020507812),
 (9, 0.1369338035583496),
 (10, 0.14037799835205078),
 (11, 0.14420413970947266),
 (12, 0.1485278606414795),
 (13, 0.15162205696105957),
 (14, 0.15520405769348145),
 (15, 0.15993809700012207)]

long也是O(n),其中n是数字的宽度,因此数量级的对数。粒度是digit的粒度,通常为2 ** 30,特别是对于较小的整数可以直接用作哈希。

其他对象将具有自己的行为,例如元组将大致汇总其内容的哈希时间。


1
Python如何知道字符串的哈希值是否被缓存,而无需先计算哈希值? - Voyager
通过存储一个表示哈希值尚未计算的值;例如,-1的哈希值 - Yann Vernier

2

我做了一个小测试来验证假设。结果似乎不依赖于输入长度的长短。

import datetime

x = ['a','aa','aaaaaaaaaaaaaaaaaaaaaaaaaaaa','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa']
for i in range(len(x)):

    for j in range(len(x)):
        print "Checking for: " + x[i] + " " + x[j]
        a = datetime.datetime.now()

        h = hash((x[i],x[j])) 
        b = datetime.datetime.now()
        c = b - a
        print "Time taken : " + str(c.microseconds) 

结果

Checking for: a a
Time taken : 87
Checking for: a aa
Time taken : 10
Checking for: a aaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aa a
Time taken : 9
Checking for: aa aa
Time taken : 8
Checking for: aa aaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaa a
Time taken : 10
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaa aa
Time taken : 8
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 8
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 11
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a
Time taken : 10
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 8
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9
Checking for: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Time taken : 9

2
这是CPython str的一个属性,它会缓存哈希值。其他类型可能会有较慢的__hash__方法。 - Yann Vernier
我的想法是传达hash()时间复杂度不是O(k),其中k是输入的长度。我尝试使用相同类型的int输入和list输入进行了相同的尝试,并得到了类似的结果。绝对时间肯定是不同的,但是相对时间在这里应该是有意义的吧? - Vizag
long(现在是 int)取决于数字的大小(基本上是以 ULONG_MAX 为底的对数,但步长更可能受缓存影响),而 list 是可变的,因此不可哈希化。 - Yann Vernier
4
准确地计时并不容易,例如,构建元组(和垃圾回收)所需的时间可能足以淹没执行哈希所需的时间。而 datetime.datetime.now() 的设计是生成单调时间戳,而不是准确的时间间隔测量。更好的性能测试函数是 time.perf_counter。另请参阅 timeit 模块,它简化了计时测试,但请记住 Yann Vernier 关于缓存的说法。 - PM 2Ring
1
感谢 @PM2Ring 的提示。 - Vizag
谢谢大家,我现在明白了 :) - Noy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接