使用局部敏感哈希的近似字符串匹配

Question

使用局部敏感哈希的近似字符串匹配

pythonstringhashlocality-sensitive-hash

27

我希望能够使用局部敏感哈希算法近似匹配字符串。我有许多含有错别字的字符串，超过10M。对于每个字符串，我想要与所有其他字符串进行比较，并选择那些编辑距离在某个阈值范围内的字符串。

也就是说，朴素的解决方案需要O(n^2)次比较。为了避免这个问题，我考虑使用局部敏感哈希算法。这样，相似的字符串将被分配到同一个哈希桶中，我只需要在每个桶内进行搜索，因此时间复杂度为O(n*C)，其中C是桶的大小。

然而，我不知道如何表示这些字符串。如果是文本，我会用向量空间来表示。我的主要问题是是否可以使用LSH以及适当的字符串向量表示来解决这个问题。

我能否使用已经实现好的库来完成这个任务？还是它取决于我的问题，所以我必须自己实现呢？是否有任何Python包可以做到这一点？

- nikosdi

5

我不明白这个问题。你说如果是“文本”，你知道该怎么做，但是你不知道如何处理字符串？在你的想法中，“文本”和“字符串”的相关区别是什么，导致这个问题无法解决？ - abarnert

1

我可以使用向量空间模型来表示文本。对于字符串而言，我没有出现次数的向量，但是有一组ASCII代码的向量，代表着不同的东西。 - nikosdi

1

我看到了问题。我熟悉的 LSH 版本（早期版本）假定欧几里得度量。对于字符串，您需要使用字符串度量...这是行不通的。您需要一个非欧几里得空间的 LSH。 - U Avalos

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- duhaime · Accepted Answer

我发现关于这个主题最好的学术资源是《大规模数据挖掘》的第三章，它提供了局部敏感哈希和最小哈希的绝佳概述。

简单来说，这个想法是将几个字符串向量化，然后在生成的向量上通过滑动窗口。如果两个向量在相同的窗口位置具有相同的值，则将它们标记为更精细的相似性分析的候选项。

Python datasketch库中有一个很好的实现(pip install datasketch)。下面是一个示例，可以展示您如何捕获模糊的字符串相似性：

from datasketch import MinHash, MinHashLSH
from nltk import ngrams

data = ['minhash is a probabilistic data structure for estimating the similarity between datasets',
  'finhash dis fa frobabilistic fata ftructure for festimating the fimilarity fetween fatasets',
  'weights controls the relative importance between minizing false positive',
  'wfights cfntrols the rflative ifportance befween minizing fflse posftive',
]

# Create an MinHashLSH index optimized for Jaccard threshold 0.5,
# that accepts MinHash objects with 128 permutations functions
lsh = MinHashLSH(threshold=0.4, num_perm=128)

# Create MinHash objects
minhashes = {}
for c, i in enumerate(data):
  minhash = MinHash(num_perm=128)
  for d in ngrams(i, 3):
    minhash.update("".join(d).encode('utf-8'))
  lsh.insert(c, minhash)
  minhashes[c] = minhash

for i in xrange(len(minhashes.keys())):
  result = lsh.query(minhashes[i])
  print "Candidates with Jaccard similarity > 0.4 for input", i, ":", result

这将返回：

Candidates with Jaccard similarity > 0.4 for input 0 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 1 : [0, 1]
Candidates with Jaccard similarity > 0.4 for input 2 : [2, 3]
Candidates with Jaccard similarity > 0.4 for input 3 : [2, 3]