Spark SQL中的哈希函数 - 不同字符串生成相同的哈希值

3
我希望为每个电子邮件生成不同的哈希值,但是我发现为不同的电子邮件生成的哈希值相同,例如:
select hash('pipohecho@hotmail.com'),
       hash('rozas_huertas@hotmail.com'),
       hash('miguelilloooooooooouu@hotmail.com'),
       hash('rjdzpmsyi@hotmail.com'),
       hash('pepe@hotmail.com')

进入图像描述

这些情况:hash('pipohecho@hotmail.com'), hash('rozas_huertas@hotmail.com'), hash('miguelilloooooooooouu@hotmail.com'), hash('rjdzpmsyi@hotmail.com') 生成同一个哈希值 -1517714944,那么我有两个问题:

  1. 这是如何可能的?
  2. 怎样使用Spark SQL生成每个电子邮件的唯一哈希值?

谢谢

2个回答

4

似乎有一篇关于哈希碰撞概率的文章,可以在这里找到。


尝试使用xxhash64(从spark-3开始),md5,sha2函数来获取唯一的哈希值。

例如:

spark.sql("""select xxhash64('pipohecho@hotmail.com'),
       xxhash64('rozas_huertas@hotmail.com'),
       xxhash64('miguelilloooooooooouu@hotmail.com'),
       xxhash64('rjdzpmsyi@hotmail.com'),
       xxhash64('pepe@hotmail.com')""").show()

#+-------------------------------+-----------------------------------+-------------------------------------------+-------------------------------+--------------------------+
#|xxhash64(pipohecho@hotmail.com)|xxhash64(rozas_huertas@hotmail.com)|xxhash64(miguelilloooooooooouu@hotmail.com)|xxhash64(rjdzpmsyi@hotmail.com)|xxhash64(pepe@hotmail.com)|
#+-------------------------------+-----------------------------------+-------------------------------------------+-------------------------------+--------------------------+
#|6332927369894443419            |-8140372026824474906               |-9124920009896762502                       |1936246589584419991            |954028670536665140        |
#+-------------------------------+-----------------------------------+-------------------------------------------+-------------------------------+--------------------------+


spark.sql("""select md5('pipohecho@hotmail.com'),
       md5('rozas_huertas@hotmail.com'),
       md5('miguelilloooooooooouu@hotmail.com'),
       md5('rjdzpmsyi@hotmail.com'),
       md5('pepe@hotmail.com')""").show()

#+------------------------------------------+----------------------------------------------+------------------------------------------------------+------------------------------------------+-------------------------------------+
#|md5(CAST(pipohecho@hotmail.com AS BINARY))|md5(CAST(rozas_huertas@hotmail.com AS BINARY))|md5(CAST(miguelilloooooooooouu@hotmail.com AS BINARY))|md5(CAST(rjdzpmsyi@hotmail.com AS BINARY))|md5(CAST(pepe@hotmail.com AS BINARY))|
#+------------------------------------------+----------------------------------------------+------------------------------------------------------+------------------------------------------+-------------------------------------+
#|7ce30aa0209335873f79e64c2eb465ff          |9d58c495ab87f2e3a4a9adc6c8fbbb76              |c283a7c6f09712fc5ba4ea30334e2c25                      |6766da691171aa5c56a70b89bd4590fa          |ab888b1a15b420b410d23b927a370013     |
#+------------------------------------------+----------------------------------------------+------------------------------------------------------+------------------------------------------+-------------------------------------+


spark.sql("""select sha2('pipohecho@hotmail.com',256),
       sha2('rozas_huertas@hotmail.com',256),
       sha2('miguelilloooooooooouu@hotmail.com',256),
       sha2('rjdzpmsyi@hotmail.com',256),
       sha2('pepe@hotmail.com',256)""").show()

#+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
#|sha2(CAST(pipohecho@hotmail.com AS BINARY), 256)                |sha2(CAST(rozas_huertas@hotmail.com AS BINARY), 256)            |sha2(CAST(miguelilloooooooooouu@hotmail.com AS BINARY), 256)    |sha2(CAST(rjdzpmsyi@hotmail.com AS BINARY), 256)                |sha2(CAST(pepe@hotmail.com AS BINARY), 256)                     |
#+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+
#|02068bc029cd26888a4ba630ecfa91b4afc2bf72c4adeabcfcd32459529c61bb|391af34e53d82ce8f12a1396d5ae74d96f3ea583cf3fd864816b29586ed002f8|fde18d7d27497717a8a77a0eace29ad5dbcb7319637be033c3e66a068a2bd983|b07300bee7e68326143c40f75b608201f5db667a18bb73b63f9f909454521753|921efc4884d3c8a32899c079024386641564ec0d0966cc059429bbd33770e421|
#+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+

4

Spark目前使用的hash实现采用了MurmurHash,更具体地说是MurmurHash3。MurmurHash以及在Spark 3.0.0+中可用的xxhash64函数都是非加密哈希函数,这意味着它们并不是专门设计成难以反转或没有碰撞的。MurmurHash和xxHash都旨在提供快速的哈希值分布,使其可以用于基于哈希的查找,同时提供足够好的哈希值分布。这种哈希函数的典型用法是实现哈希表,其中键映射到桶(bucket),每个桶都有一个键/值对(kvp)的链接列表。在这种情况下,碰撞并不致命——它们只会导致更长的kvp列表,需要更长的遍历时间。关于MurmurHash存在一份详尽的密码分析报告

正如@notNull所建议的那样,您应该使用密码哈希函数,例如SHA-2或MD5。如果哈希值被存储在某个地方,则应避免使用MD5,并在进行哈希之前用一个固定但随机选择(例如,在部署期间)的“盐”(salt)对电子邮件进行哈希:

select sha2(concat('39u!6fgs3#', 'email@domain.com'), 256)
--       fixed salt ---^^^     value ---^^^

在对值进行哈希之前进行盐加密,如果哈希值泄露出去,不知道盐值的人要通过暴力破解来逆推哈希值将变得困难甚至不可能。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接