整数数组哈希化

Question

整数数组哈希化

c#arrayshashinteger

9

我正在使用哈希集（hash set），其中存储了整数数组（32位）。这意味着我需要一个算法来哈希整数数组，我正在寻找32位整数（C# int）哈希算法。

我尝试编辑了两个现有的算法，您可以在底部看到它们的四个版本，包括它们的基准测试。

我的问题如下：

1. 您认为底部算法对此目的是否合适？

2. 是否有更好的算法可用于此目的？

程序信息

通常，一个数组有16个元素，整数小于10，尽管两者都必须支持更大的值。我可以说出现机会最大的值是200个元素和20的整数值。
我在广度优先搜索算法中使用HashSet，以比较两个节点是否相同。http://en.wikipedia.org/wiki/Breadth-first_search。
对于这个特定的程序，我不能使用不安全的代码。

基准测试和代码

以下是我的基准测试和代码，按程序性能从差到好的顺序排列。

Coordinates2D是一个包含int x和int y的结构体。
运行结束时HashSet中的总条目数为356525
我无法准确地确定碰撞次数。给出的数字是实际上被比较但不相等（相同哈希，不同对象）的次数。尽管这在同一对象之间多次发生。该值每次执行都会有所变化，因为程序是多线程的。
MurMurHash3种子为const uint seed = 144

使用直接从坐标检索的字节的MurMurHash3

代码等于https://gist.github.com/automatonic/3725443。使用以下代码检索字节数组：

int size = Marshal.SizeOf(typeof(Coordinates2D));
int length = carCoords.Length;
Byte[] bytes = new Byte[size * length];
for (int i = 0; i < length; ++i)
{
    GCHandle pinStructure = GCHandle.Alloc(carCoords[i], GCHandleType.Pinned);
    Marshal.Copy(pinStructure.AddrOfPinnedObject(), bytes, i*size, size);
    pinStructure.Free();
}

// Hash the byte array
return MurMurHash3.Hash(new System.IO.MemoryStream(bytes));

这种方式非常低效，因为需要进行复制操作。

性能: 40880毫秒
碰撞: < 84

使用对象中的整数检索的字节来进行 MurMurHash3

public static int Hash2(RushHourPathLengthNode.Coordinates2D[] coords)
{
    const uint c1 = 0xcc9e2d51;
    const uint c2 = 0x1b873593;

    uint h1 = seed;
    uint k1 = 0;
    uint streamLength = (uint)coords.Length * 2;

    for (int i = 0, l = coords.Length; i < l; ++i)
    {
        // Do it for X
        byte[] chunk = BitConverter.GetBytes(coords[i].x);

        /* Get four bytes from the input into an uint */
        k1 = (uint)
           (chunk[0]
          | chunk[1] << 8
          | chunk[2] << 16
          | chunk[3] << 24);

        /* bitmagic hash */
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;


        // Do it for y
        chunk = BitConverter.GetBytes(coords[i].y);

        /* Get four bytes from the input into an uint */
        k1 = (uint)
           (chunk[0]
          | chunk[1] << 8
          | chunk[2] << 16
          | chunk[3] << 24);

        /* bitmagic hash */
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;
    }

    // finalization, magic chants to wrap it all up
    h1 ^= streamLength;
    h1 = fmix(h1);

    unchecked //ignore overflow
    {
        return (int)h1;
    }
}

去掉复制操作后，效率大大提高。

性能：16640毫秒
碰撞次数：< 92

MurMurHash3 使用整数

public static int Hash(RushHourPathLengthNode.Coordinates2D[] coords)
{
    const uint c1 = 0xcc9e2d51;
    const uint c2 = 0x1b873593;

    uint h1 = seed;
    uint k1 = 0;
    uint streamLength = (uint)coords.Length * 2;

    for (int i = 0, l = coords.Length; i < l; ++i)
    {
        k1 = (uint)coords[i].x;

        //bitmagic hash
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;

        k1 = (uint)coords[i].y;

        //bitmagic hash
        k1 *= c1;
        k1 = rotl32(k1, 15);
        k1 *= c2;

        h1 ^= k1;
        h1 = rotl32(h1, 13);
        h1 = h1 * 5 + 0xe6546b64;
    }

    // finalization, magic chants to wrap it all up
    h1 ^= streamLength;
    h1 = fmix(h1);

    unchecked //ignore overflow
    {
        return (int)h1;
    }
}

性能：13027毫秒
碰撞：<95

使用整数加乘哈希

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash * 31 + carCoords[i].x;
    hash = hash * 31 + carCoords[i].y;
}
return hash;

性能：4564毫秒
碰撞：< 44

正如你所看到的，这个更加高效。它适用于任何质数。据我了解，没有科学证明这样会奏效，这让我不太喜欢。

根据Michal B.的说法，使用位移可能会更快。然而，测试表明这不是一个成功的哈希函数。问题需要更长时间才能运行（在5分钟内无法完成）。位移可能很好，但31（质数）似乎至关重要。

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash << 5 - carCoords[i].x;
    hash = hash << 5 - carCoords[i].y;
}
return hash;

- Aart Stuurman

1

我猜你不想使用unsafe来处理int[]，就像处理byte[]一样？ - rene

对于这个特定的程序，我无法使用不安全的代码。不过还是谢谢你的建议。 - Aart Stuurman

2

我认为这对一个哈希码来说是足够的。它可能不足以进行CRC检查，但对于哈希码，您只需要一个合理的分布，而这个可以提供。还可以参考此线程：https://dev59.com/c3A75IYBdhLWcg3wPmZ8 - Matthew Watson

谢谢你。我之前已经阅读了这个帖子。Doc Brown的回答与我在这里提供的类似。至于BlueMonkMN提供的算法好不好，我无法发表任何意见。也许你可以？ - Aart Stuurman

X和Y的最大值是多少？ - Fredou

显示剩余23条评论

2个回答

3

您考虑过使用空间填充曲线来生成哈希值吗？这将最小化（或消除）所选分辨率（maxX，maxY）的冲突。

以下是两个使用此方法的SO问题及其答案。

希望这可以帮助您！

- Ani

这看起来非常有前途。我会研究一下它。 - Aart Stuurman

我已经尝试过这个方法，还有Z-order曲线。它们都很好，但对于这个问题不适用。将位置映射到int上可以解决问题，但仍然会留下一个整数数组。我可以重复这个过程或者做类似的事情，但最终MurMurHash3更快。 - Aart Stuurman

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aart Stuurman · Accepted Answer

最终我选择了最后一个算法。

int hash = 17;
for (int i = 0, l = carCoords.Length; i < l; ++i)
{
    hash = hash * 19 + carCoords[i].x;
    hash = hash * 19 + carCoords[i].y;
}
return hash;

这个计算速度非常快，对于我使用的（小）数字，哈希表是很棒的选择。

如果您要使用它，请确保您使用的数字是质数。因为这个原因，您不能使用位移来优化它。