什么是针对相对较短字符串的最佳32位哈希函数?
这里的字符串是由英文字母、数字、空格以及一些其他字符(#
, $
, .
, ...)组成的标签名。例如:单元测试
,C# 2.0
。
我正在寻找“最好”的哈希函数,即“最小碰撞”,对于我的目标来说性能并不重要。
什么是针对相对较短字符串的最佳32位哈希函数?
这里的字符串是由英文字母、数字、空格以及一些其他字符(#
, $
, .
, ...)组成的标签名。例如:单元测试
,C# 2.0
。
我正在寻找“最好”的哈希函数,即“最小碰撞”,对于我的目标来说性能并不重要。
我不确定这是否是最好的选择,但这里是一个用于字符串的哈希函数:
编程实践(哈希表,第57页)
/* hash: compute hash value of string */
unsigned int hash(char *str)
{
unsigned int h;
unsigned char *p;
h = 0;
for (p = (unsigned char*)str; *p != '\0'; p++)
h = MULTIPLIER * h + *p;
return h; // or, h % ARRAY_SIZE;
}
经验上,对于 ASCII 字符串的哈希函数,取乘数为31和37已被证明是一个不错的选择。
很抱歉对于此事回复非常晚。今年早些时候,我写了一篇标题为“散列短字符串的方法”的页面,可能会对这个讨论有所帮助。总之,我发现CRC-32和FNV-1a对于散列短字符串是优秀的选择。它们高效并且在我的测试中产生了广泛分布和无碰撞的哈希值。当输出折叠至32位时,我惊讶地发现MD5、SHA-1和SHA-3产生了少量碰撞。
_mm_crc32_uxx
内嵌函数,因为它们对于短字符串是最优的。(对于长键也是如此,但是更好地使用Adler的线程版本,如zlib)使用 MaPrime2c
哈希函数:
static const unsigned char sTable[256] =
{
0xa3,0xd7,0x09,0x83,0xf8,0x48,0xf6,0xf4,0xb3,0x21,0x15,0x78,0x99,0xb1,0xaf,0xf9,
0xe7,0x2d,0x4d,0x8a,0xce,0x4c,0xca,0x2e,0x52,0x95,0xd9,0x1e,0x4e,0x38,0x44,0x28,
0x0a,0xdf,0x02,0xa0,0x17,0xf1,0x60,0x68,0x12,0xb7,0x7a,0xc3,0xe9,0xfa,0x3d,0x53,
0x96,0x84,0x6b,0xba,0xf2,0x63,0x9a,0x19,0x7c,0xae,0xe5,0xf5,0xf7,0x16,0x6a,0xa2,
0x39,0xb6,0x7b,0x0f,0xc1,0x93,0x81,0x1b,0xee,0xb4,0x1a,0xea,0xd0,0x91,0x2f,0xb8,
0x55,0xb9,0xda,0x85,0x3f,0x41,0xbf,0xe0,0x5a,0x58,0x80,0x5f,0x66,0x0b,0xd8,0x90,
0x35,0xd5,0xc0,0xa7,0x33,0x06,0x65,0x69,0x45,0x00,0x94,0x56,0x6d,0x98,0x9b,0x76,
0x97,0xfc,0xb2,0xc2,0xb0,0xfe,0xdb,0x20,0xe1,0xeb,0xd6,0xe4,0xdd,0x47,0x4a,0x1d,
0x42,0xed,0x9e,0x6e,0x49,0x3c,0xcd,0x43,0x27,0xd2,0x07,0xd4,0xde,0xc7,0x67,0x18,
0x89,0xcb,0x30,0x1f,0x8d,0xc6,0x8f,0xaa,0xc8,0x74,0xdc,0xc9,0x5d,0x5c,0x31,0xa4,
0x70,0x88,0x61,0x2c,0x9f,0x0d,0x2b,0x87,0x50,0x82,0x54,0x64,0x26,0x7d,0x03,0x40,
0x34,0x4b,0x1c,0x73,0xd1,0xc4,0xfd,0x3b,0xcc,0xfb,0x7f,0xab,0xe6,0x3e,0x5b,0xa5,
0xad,0x04,0x23,0x9c,0x14,0x51,0x22,0xf0,0x29,0x79,0x71,0x7e,0xff,0x8c,0x0e,0xe2,
0x0c,0xef,0xbc,0x72,0x75,0x6f,0x37,0xa1,0xec,0xd3,0x8e,0x62,0x8b,0x86,0x10,0xe8,
0x08,0x77,0x11,0xbe,0x92,0x4f,0x24,0xc5,0x32,0x36,0x9d,0xcf,0xf3,0xa6,0xbb,0xac,
0x5e,0x6c,0xa9,0x13,0x57,0x25,0xb5,0xe3,0xbd,0xa8,0x3a,0x01,0x05,0x59,0x2a,0x46
};
#define PRIME_MULT 1717
unsigned int
maPrime2cHash (unsigned char *str, unsigned int len)
{
unsigned int hash = len, i;
for (i = 0; i != len; i++, str++)
{
hash ^= sTable[( *str + i) & 255];
hash = hash * PRIME_MULT;
}
return hash;
}
请访问www.amsoftware.narod.ru/algo2.html,了解MaFastPrime、MaRushPrime等测试相关信息。
#include <cstdint>
#include <string_view> // C++17
uint32_t short_string_hash(std::string_view str) {
uint32_t hash = 0;
const uint32_t num32 = static_cast<uint32_t>(str.length() / sizeof(uint32_t));
constexpr uint32_t magic = 37;
{
const uint32_t *pU32 = reinterpret_cast<const uint32_t *>(str.data());
for (uint32_t i = 0; i < num32; ++i) {
hash = (magic * hash) + pU32[i];
}
}
{
str.remove_prefix(num32 * sizeof(uint32_t));
for (const char c : str) {
hash = (magic * hash) + c;
}
}
return hash;
}
我对这个哈希算法进行了15280个独特的短字符串的测试,没有发现冲突。
注意:这基本上只是https://dev59.com/3HE95IYBdhLWcg3wWMdQ#2351171的批处理版本。将块分组为u32,然后处理剩余部分。