我发现关于这个主题最好的学术资源是《大规模数据挖掘》的第三章,它提供了局部敏感哈希和最小哈希的绝佳概述。
简单来说,这个想法是将几个字符串向量化,然后在生成的向量上通过滑动窗口。如果两个向量在相同的窗口位置具有相同的值,则将它们标记为更精细的相似性分析的候选项。
Python datasketch库中有一个很好的实现(pip install datasketch
)。下面是一个示例,可以展示您如何捕获模糊的字符串相似性:
from datasketch import MinHash, MinHashLSH
from nltk import ngrams
data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
'weights controls the relative importance between minizing false positive',
'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]
lsh = MinHashLSH(threshold=0.4, num_perm=128)
minhashes = {}
for c, i in enumerate(data):
minhash = MinHash(num_perm=128)
for d in ngrams(i, 3):
minhash.update("".join(d).encode('utf-8'))
lsh.insert(c, minhash)
minhashes[c] = minhash
for i in xrange(len(minhashes.keys())):
result = lsh.query(minhashes[i])
print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result
这将返回:
Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]