我目前正尝试理解Python内置的frozenset
数据类型所定义的哈希函数机制。下面是该函数的实现方法以供参考。我特别关注的是这个散列操作的选择背后的原因:
lambda h: (h ^ (h << 16) ^ 89869747) * 3644798167
这里的h
是每个元素的哈希值。有人知道这些哈希值从哪来的吗?也就是说,选择这些数字是否有特定原因?还是仅仅任意选择的?
下面是官方CPython实现中的代码片段:
static Py_hash_t
frozenset_hash(PyObject *self)
{
PySetObject *so = (PySetObject *)self;
Py_uhash_t h, hash = 1927868237UL;
setentry *entry;
Py_ssize_t pos = 0;
if (so->hash != -1)
return so->hash;
hash *= (Py_uhash_t)PySet_GET_SIZE(self) + 1;
while (set_next(so, &pos, &entry)) {
/* Work to increase the bit dispersion for closely spaced hash
values. The is important because some use cases have many
combinations of a small number of elements with nearby
hashes so that many distinct combinations collapse to only
a handful of distinct hash values. */
h = entry->hash;
hash ^= (h ^ (h << 16) ^ 89869747UL) * 3644798167UL;
}
hash = hash * 69069U + 907133923UL;
if (hash == -1)
hash = 590923713UL;
so->hash = hash;
return hash;
}
还有一个Python中的等效实现:
def _hash(self):
MAX = sys.maxint
MASK = 2 * MAX + 1
n = len(self)
h = 1927868237 * (n + 1)
h &= MASK
for x in self:
hx = hash(x)
h ^= (hx ^ (hx << 16) ^ 89869747) * 3644798167
h &= MASK
h = h * 69069 + 907133923
h &= MASK
if h > MAX:
h -= MASK + 1
if h == -1:
h = 590923713
return h