一个适用于C语言的最小哈希函数?

46

我不能使用boost:hash,因为我必须坚持使用C语言,不能使用C++。

但是,我需要对大量(10K到100k)的令牌字符串(长度为5到40字节)进行哈希,以便在其中搜索最快。

MD5、SHA1或任何长哈希函数似乎对于简单的任务来说太重了,我不做加密。此外还有存储和计算成本。

因此我的问题是:

  1. 在大多数实际情况下保证避免冲突的最简单的哈希算法是什么?

  2. 应该使用多少位来进行哈希值?我正在开发32位系统。Perl/Python中的哈希算法是否也使用32位哈希?还是我必须跳到64位?

  3. 关于常见脚本语言中哈希表的实现:实现是否检查冲突,还是我可以完全避开这一部分?


你考虑过使用GLib吗?https://developer.gnome.org/glib/2.46/glib-Hash-Tables.html - Bastien Léonard
25
以下页面包含多种用C语言(以及其他许多编程语言)实现的通用哈希函数:http://partow.net/programming/hashfunctions/index.html - Matthieu N.
另请参阅:字符串的哈希函数。我在这里发布了K&R和GCC C++11的哈希函数,供一些测试使用。 - undefined
7个回答

24

3
我建议你看一下谢家华的分析中忽略了的一个选项:MurmurHash2。http://en.wikipedia.org/wiki/MurmurHash - Steven Sudit
@StevenSudit,Murmur3(Murmur的当前版本)实际上比Paul Hsieh的SuperFastHash慢一点。然而,SuperFastHash在单词和名称上有很多碰撞。而且它也不如FarmHash那样快。所以我不建议使用它。 - undefined

13
  1. 这里有一个不错的概述,介绍了一些知名的哈希函数。

  2. 32位应该能够正常工作。

  3. 除非您想编写一个有趣的哈希表,否则您总是需要检查冲突。


如果您不特别关心得到哪个答案,那么您不需要检查冲突。优点是您不必在哈希表中存储原始键,因此可以节省大量空间。 - Zan Lynx
2
嗯,这种非确定性的行为就是我所说的“有趣”。 - arul

8
一般的哈希函数用于哈希表查找。它明确指出不要用于加密目的,但由于您已经明确表示没有这样的意图,所以应该没问题。

它包括哈希函数调查可供尝试


5

如果您在一个类posix的系统上,并坚持使用纯C,我建议您直接使用系统已有的工具。man 3 hcreate提供了所有细节,或者您可以在这里找到在线版本http://linux.die.net/man/3/hcreate


2

有几行代码里有相当不错的哈希函数,但它们不如针对特定CPU系列进行优化的函数快速。因此,“最小化”的低代码函数只应在哈希性能不是主要关注点,或者在减少空间(RAM使用或代码大小)比速度更重要时使用。

话虽如此,一些最好的简单哈希函数包括FNV1aMurmurOAATRSHashSDBM。后两者是四个中最快的。然而,SDBM也很容易记住,所以它可能是其中最有名的。

SDBM是“乘法哈希函数”的一个例子,DJB2和K&Rv2也属于这类函数。这些函数只是将当前哈希值乘以一个奇数,并加上下一个字节:hash = hash * factor + data[i]。SDBM使用65599作为因子,DJB2使用33,K&Rv2使用31。

RSHash和FNV1a比SDBM更好,因为它们使用了更大的因子。而MurmurOAAT更好,因为它还使用了右移操作(除了“加或异或字节”和乘法)。这样产生的avalanching效果更好。
当你使用全部64位或者可能在64位架构上开发时,使用64位哈希是有意义的。如果你将哈希用于哈希表,使用64位哈希就没有太多意义,因为你的哈希表很可能不会有数十亿个槽位。但是,如果你将哈希用作指纹(校验和),64位哈希比32位哈希要好得多。
以下是所提到的哈希函数之间的比较,以及与一些现代优化函数(如Murmur3和FarmHash)的相对比较。还包括其他一些在其他答案中提到的哈希函数。其中大多数函数是32位的。我还包括了一些截断为32位的64位哈希函数进行比较。
(1) 碰撞和avalanching
                       |        Collisions           |  Avalanching
Hash function          | random 16B |  words + names | words + names
=====================================================================
FNV1a                  |     0.20%  |     45,  0.01% |        98.0
MurmurOAAT             |     0.20%  |     50,  0.01% |       100.0
RSHash                 |     0.20%  |     51,  0.01% |        95.5
SDBM                   |     0.20%  |     53,  0.01% |        93.1
Jenkins OAAT           |     0.19%  |     66,  0.01% |       100.0
K&R v2                 |     0.20%  |    767,  0.12% |        89.8
SuperFastHash          |     0.21%  |    808,  0.13% |       100.0
DJB2                   |     0.20%  |    856,  0.13% |        89.7
XOR_rotate             |     0.19%  |   2156,  0.34% |        89.5
DEKHash                |     0.19%  |   2312,  0.36% |        88.6
Fletcher32             |     0.19%  |   5274,  0.82% |        90.9
Adler32                |    67.44%  | 371448, 58.03% |        73.9
---------------------------------------------------------------------
MurmurHash3_x86_32     |     0.19%  |     42,  0.01% |       100.0
MurmurHash3_x64_32     |     0.20%  |     50,  0.01% |       100.0
farmhash_fingerprint32 |     0.20%  |     42,  0.01% |       100.0
farmhash_fingerprint64 |     0.19%  |     36,  0.01% |       100.0  <-- best
CRC32                  |     0.20%  |     45,  0.01% |        99.8

Adler32是唯一一个在16字节随机缓冲区上表现不佳且产生大量碰撞的算法。它是一个真正糟糕的哈希函数!
对于单词和名称的性能差是我对哈希算法质量差的一个指标。这就是为什么我不会推荐在这个数据集上有超过0.01%碰撞的任何算法。
(2) 速度
                       |                   Time, ms
Hash function          | random 16B | random 16MB | words + names 
==================================================================
FNV1a                  |     132.3  |      339.3  |         12.9
MurmurOAAT             |     170.3  |      500.7  |         11.9
RSHash                 |     122.2  |      250.5  |         10.9
SDBM                   |     116.1  |      250.8  |         11.3
Jenkins OAAT           |     159.0  |      417.8  |         11.2
K&R v2                 |     116.1  |      250.2  |         11.1
SuperFastHash          |      71.0  |      127.0  |          9.8
DJB2                   |     169.5  |      251.0  |         10.8
XOR_rotate             |     119.8  |      167.6  |         10.9
DEKHash                |     115.5  |      167.0  |         10.9
Fletcher32             |      79.0  |       52.2  |         18.5
Adler32                |     391.6  |      669.0  |         12.3
------------------------------------------------------------------
MurmurHash3_x86_32     |      93.0  |      124.3  |         10.9
MurmurHash3_x64_32     |      96.3  |       53.6  |         13.5
farmhash_fingerprint32 |      87.2  |       52.2  |          5.4
farmhash_fingerprint64 |      48.8  |       17.3  |          5.7  <-- fastest
CRC32 (bit-by-bit)     |    1189.6  |     2091.1  |         23.0
CRC32 (table-driven)   |     262.0  |      667.3  |         11.1

对于计时,我在一台Apple M1 Pro机器上编译并运行了我的代码。在Intel x86-64架构上,特别是对于针对Intel指令集进行了大量优化的FarmHash,你可能会得到不同的速度。但总体情况可能保持不变。
(3) 代码
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <unistd.h>

#define SMALL   16
#define LARGE   (1 << 24)   // size of 16 MiB
#define SEED    0x12345678

uint8_t buffer[LARGE * SMALL];
uint32_t hashes[LARGE];

typedef uint32_t (*hash_function) (const uint8_t*, size_t);

uint32_t DJB2_hash(const uint8_t* buf, size_t size) {
    uint32_t hash = 5381;
    for (size_t i = 0; i < size; i++)
        hash = ((hash << 5) + hash) + buf[i]; /* hash * 33 + byte */
    return hash;
}

uint32_t SDBM_hash(const uint8_t* buf, size_t size) {
    uint32_t hash = 0;
    for (size_t i = 0; i < size; i++)
        hash = (hash << 6) + (hash << 16) - hash + buf[i]; /* hash * 65599 + byte */
    return hash;
}

uint32_t FNV1a(const uint8_t* data, size_t size) {
    uint32_t h = 2166136261UL;
    for (size_t i = 0; i < size; i++) {
        h ^= data[i];
        h *= 16777619;
    }
    return h;
}

uint64_t FNV1a_64(const uint8_t* data, size_t size) {
    uint64_t h = 0xcbf29ce484222325UL;
    for (size_t i = 0; i < size; i++) {
        h ^= data[i];
        h *= 0x00000100000001B3UL;
    }
    return h;
}

uint32_t MurmurOAAT_32(const uint8_t* data, size_t size) {
    // One-byte-at-a-time hash based on Murmur's mix
    uint32_t h = SEED;
    for (size_t i = 0; i < size; i++) {
        h ^= data[i];
        h *= 0x5bd1e995;
        h ^= h >> 15;
    }
    return h;
}

uint32_t KR_v2_hash(const uint8_t* data, size_t size) {
    // a.k.a. Java String hashCode(), a.k.a. BKDR Hash Function
    uint32_t hashval = 0;
    for (size_t i = 0; i < size; i++)
        hashval = 31*hashval + data[i];
    return hashval;
}

uint32_t Jenkins_one_at_a_time_hash(const uint8_t* data, size_t size) {
    // a.k.a. simply "One-at-a-Time Hash"
    uint32_t hash = 0;
    for (size_t i = 0; i < size; i++) {
        hash += data[i];
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }
    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);
    return hash;
}

#define POLY_CRC32C 0x82f63b78
#define POLY_CRC32  0xedb88320
uint32_t CRC32b(const uint8_t *data, size_t size) {
    unsigned int crc = 0xFFFFFFFF, mask;
    for (size_t i = 0; i < size; i++) {
        crc ^= data[i];
        for (int j = 7; j >= 0; j--) {
            mask = -(crc & 1);
            crc = (crc >> 1) ^ (POLY_CRC32 & mask);
        }
    }
    return ~crc;
}

inline uint32_t _rotl32(uint32_t x, int32_t bits) {
    return x<<bits | x>>(32-bits);      // C idiom: will be optimized to a single CPU operation
}

uint32_t XOR_rotate(const uint8_t* data, size_t size) { 
    // Source: https://dev59.com/PWsz5IYBdhLWcg3wxq4V#7666668
    uint32_t result = 0x55555555;
    for (size_t i = 0; i < size; i++) {
        result ^= data[i];
        result = _rotl32(result, 5);
    }
    return result;
}

uint32_t Adler32_slow(const uint8_t* data, size_t len) {
    // This is a slow implementation of Adler32, but it doesn't matter because
    // it's a VERY BAD hash function anyway. Don't use it!
    const uint32_t MOD_ADLER = 65521;
    uint32_t a = 1, b = 0;
    for (size_t i = 0; i < len; i++) {
        a = (a + data[i]) % MOD_ADLER;
        b = (b + a) % MOD_ADLER;
    }
    return (b << 16) | a;
}

uint32_t RSHash(const uint8_t* data, size_t size) {
    // Introduced by Robert Sedgewick in his "Algorithms in C" (1990) book.
    // Source: http://www.partow.net/programming/hashfunctions/index.html
    // RSHash seems to outperform JSHash, PJWHash (a.k.a. ELFHash) and APHash from the same source in all respects.
    uint32_t a = 63689, b = 378551;
    uint32_t hash = 0;
    for (uint32_t i = 0; i < size; i++) {
       hash = hash * a + data[i];
       a *= b;
    }
    return hash;
}

uint32_t DEKHash(const uint8_t* str, size_t length) {
    // Introduced by Donald E. Knuth in "The Art Of Computer Programming", Volume 3.
    // Source: http://www.partow.net/programming/hashfunctions/index.html
    unsigned int hash = length;
    for (size_t i = 0; i < length; i++)
       hash = ((hash << 5) ^ (hash >> 27)) ^ str[i];
    return hash;
}

uint32_t Fletcher32(const uint8_t *data_, size_t len) {
    // Note: when `len` (size in bytes) is odd, this function will read past the last byte.
    // However, for C-strings, the next byte will be zero, which is sufficient for this test.
    // In other cases, you have to do zero-padding.
    const uint16_t* data = (const uint16_t*) data_;
    uint32_t c0, c1;
    len = (len + 1) & ~1;      /* Round up len to words */

    /* We similarly solve for n > 0 and n * (n+1) / 2 * (2^16-1) < (2^32-1) here. */
    /* On modern computers, using a 64-bit c0/c1 could allow a group size of 23726746. */
    for (c0 = c1 = 0; len > 0; ) {
        size_t blocklen = len;
        if (blocklen > 360*2) {
            blocklen = 360*2;
        }
        len -= blocklen;
        do {
            c0 = c0 + *data++;
            c1 = c1 + c0;
        } while ((blocklen -= 2));
        c0 = c0 % 65535;
        c1 = c1 % 65535;
    }
    return (c1 << 16 | c0);
}

uint32_t Fletcher32_wrap(const uint8_t* data, size_t size) {
    if (size & 1) {
        // Add zero-padding for odd sizes. (This is, actually, unnecessary for
        // C-strings, because they are terminated by a nul-byte by definition.)
        uint8_t* buf = malloc(size + 1);
        memcpy(buf, data, size);
        buf[size] = 0;
        uint32_t res = Fletcher32(buf, size + 1);
        free(buf);
        return res;
    }
    return Fletcher32(data, size);
}

struct timeval stop, start;
void timing_start() {
    gettimeofday(&start, NULL);
}

void timing_end() {
    gettimeofday(&stop, NULL);
    printf("took %lu us, ", (stop.tv_sec - start.tv_sec) * 1000000 + stop.tv_usec - start.tv_usec);
}

#define ALGO_COUNT 12
hash_function get_hash_function(int algo) {
    hash_function res;
    switch (algo) {
        case 1:  printf("FNV1a                 "); res = FNV1a; break;
        case 2:  printf("MurmurOAAT            "); res = MurmurOAAT_32; break;
        case 3:  printf("RSHash                "); res = RSHash; break;
        case 4:  printf("SDBM                  "); res = SDBM_hash; break;
        case 5:  printf("Jenkins OAAT          "); res = Jenkins_one_at_a_time_hash; break;
        case 6:  printf("K&R v2                "); res = KR_v2_hash; break;
        case 7:  printf("DJB2                  "); res = DJB2_hash; break;
        case 8:  printf("XOR_rotate            "); res = XOR_rotate; break;
        case 9:  printf("DEKHash               "); res = DEKHash; break;
        case 10: printf("Fletcher32            "); res = Fletcher32_wrap; break;
        case 11: printf("Adler32               "); res = Adler32_slow; break;
        case 12: printf("CRC32 (bit-by-bit)    "); res = CRC32b; break;
        default: printf("ERROR: unknown algo \n"); exit(-1);
    }
    printf("\t");
    return res;
}

void fill_random(uint32_t* buffer, size_t count) {
    for (size_t i = 0; i < count; i++)
        buffer[i] = random();
}

void show_sequence() {
    printf("Generated random sequence:");
    for (int i = 0; i < SMALL; i++) {
        printf(" %02x", buffer[i]);
    }
    printf("\n");
}

void count_collisions(const uint32_t* hashes, size_t count) {
    size_t slots = count * 32;
    uint32_t* cnt_table = calloc(slots, sizeof(uint32_t));
    size_t collisions = 0;
    for (size_t i = 0; i < count; i++) {
        size_t j = hashes[i] % slots;
        while (cnt_table[j] != 0 && cnt_table[j] != hashes[i]) {
            j = (j + 1) % slots;
        }
        if (cnt_table[j] == 0) {
            cnt_table[j] = hashes[i];
        } else {
            collisions++;
        }
    }
    free(cnt_table);
    double share = 100.0*(double)collisions/count;
    printf("%lu collisions out of %lu, %0.2f\n", collisions, count, share);
}

void evaluate_avalanching(const uint32_t* hashes, size_t count) {
    // Looks at hashes of consecutive sorted strings and assesses their
    // similarity score. The lower the total similarity, the worse avalanching.
    uint32_t score = 0;
    char last[9], current[9];
    for (size_t i = 0; i < count; i++) {
        sprintf(current, "%08X", hashes[i]);
        if (i) {
            for (int j = 0; j < 8; j++)
                if (last[j] == current[j]) score++;
        }
        strcpy(last, current);
    }
    float similarity = 100.*score/count/8;
    float avalanching = similarity <= 6.25 ? 100 : (100 - similarity) / (100 - 6.25) * 100;
    printf("score: %u, similarity: %.2f, avalanching: %.1f\n", score, similarity, avalanching);
}

void calc_buffer_hashes(char* type, const uint8_t* buffer, size_t length, size_t count) {
    printf("Calculating hashes of %s buffers...\n", type);
    for (int algo = 1; algo <= ALGO_COUNT; algo++) {
        hash_function func = get_hash_function(algo);
        timing_start();
        for (size_t i = 0; i < count; i++) {
            hashes[i] = func(buffer + length*i, length);
        }
        timing_end();
        count_collisions(hashes, count);
    }
}

void calc_word_hashes(const char* fname, char* buffer) {
#define MAX_WORD_LENGTH 50
    printf("Calculating hashes of English words... (please, provide name of the file with words as first argument)\n");

    char line[MAX_WORD_LENGTH + 1];
    size_t word_count = 0;

    // Open file
    FILE* f = fopen(fname, "r");

    // Read file line by line
    while (fgets(line, sizeof(line), f)) {
        line[strcspn(line, "\n")] = '\0';   // strip newline
        strcpy(buffer + word_count*MAX_WORD_LENGTH, line);
        word_count++;
    }
    fclose(f);

    // Calculate hashes
    for (int algo = 1; algo <= ALGO_COUNT; algo++) {
        hash_function func = get_hash_function(algo);
        timing_start();
        for (size_t i = 0; i < word_count; i++) {
            char* word = buffer + MAX_WORD_LENGTH*i;
            size_t len = strlen(word);
            hashes[i] = func((uint8_t*)word, len);
        }
        timing_end();
        count_collisions(hashes, word_count);
        evaluate_avalanching(hashes, word_count);
    }
}

int main(int argc, char* argv[]) {
    srandom(SEED);
    fill_random((uint32_t*) buffer, LARGE * SMALL / sizeof(uint32_t));
    show_sequence();
    calc_buffer_hashes("small", buffer, SMALL, LARGE);
    calc_buffer_hashes("large", buffer, LARGE, SMALL);
    calc_word_hashes(argv[1], (char*)buffer);
    return 0;
}

如何编译和运行:
gcc -std=c99 -O2 minimal_hash.c -o minimal_hash
./minimal_hash words.txt

- 其中 words.txt 是 所有英文单词人名 的组合(共640143个唯一的排序字符串)。
- 在 此存储库 中查看完整代码,包括高性能函数。

1

xxhash 是一个相当快速和简单的选择。一个简单的代码可以使用 XXH32 函数:

unsigned int XXH32 (const void* input, int len, unsigned int seed);

这是一个32位的哈希值。由于“len”是“int”类型,对于超过2^31-1字节的大数据,请使用以下内容:
void*         XXH32_init   (unsigned int seed);
XXH_errorcode XXH32_update (void* state, const void* input, int len);
unsigned int  XXH32_digest (void* state);

1

对于长字符串,可以尝试使用Adler32算法,对于短字符串,可以使用Murmur2算法。


3
Adler32并不是一个很好的散列函数,实际上,作为一个散列函数,它甚至比CRC-32还要差。另一方面,Murmur2是一个分布很好且最坏情况下表现出色的非常快速的散列函数,因此没有理由将其用于短字符串。我并不真正理解你建议的基础。 - Steven Sudit
CRC-32作为哈希函数实际上还不错,只是速度较慢。但是Adler32真的很糟糕! - undefined

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接