Ruby中字典字符串的快速模糊/近似搜索

Question

Ruby中字典字符串的快速模糊/近似搜索

rubyperformancealgorithmlevenshtein-distancefuzzy-search

8

我有一个包含50K至100K个字符串的字典（长度可达到50+个字符），我想找出给定的字符串是否在字典中，还要考虑一些“编辑”距离。例如Levenshtein距离。在搜索之前，我可以预先计算任何类型的数据结构。

我的目标是尽可能快地对那个字典运行成千上万的字符串，并返回最接近的邻居。如果有显着更快的算法来判断给定的字符串是否在字典中，则只需获得一个布尔值即可。

为此，我首先尝试计算所有的Levenshtein距离并取最小值，但这显然非常慢。因此，我尝试实现了基于本文 http://stevehanov.ca/blog/index.php?id=114 的Levenshtein Trie。

请查看我的 gist 进行重现基准测试: https://gist.github.com/nicolasmeunier/7493947 以下是我在我的机器上得到的一些基准测试结果: Edit distance of 0 (perfect match)

Benchmark.measure { 10.times { dictionary.search(random_word, 0) } }
<Benchmark::Tms:0x007fa59bad8908 @label="", @real=0.010889, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.00999999999999801, @total=0.00999999999999801>

编辑距离为2时，速度会变得非常慢。

Benchmark.measure { 10.times { dictionary.search(random_word, 2) } }
<Benchmark::Tms:0x007fa58c9ca778 @label="", @real=3.404604, @cstime=0.0, @cutime=0.0, @stime=0.020000000000000018, @utime=3.3900000000000006, @total=3.4100000000000006>

从2次编辑距离开始，速度会急剧下降（每个测试字符串平均需要1秒以上）。

我想知道如何/是否可以显着加快这个过程。如果已经在ruby/gems中实现了现有的解决方案，我也不想重复造轮子...

编辑1：在我的情况下，我希望大多数我匹配字典的字符串不在其中。因此，如果有任何算法可以快速丢弃一个字符串，那确实可以帮上大忙。

谢谢， Nicolas

- Nicolas M.

Nicolas，我犹豫是否要发表评论，因为我不熟悉这方面的文献，也不想偏离讨论主题，但我想起了很久以前遇到的一个问题，那就是如何确定一个文件是否与远程计算机上的大型“库”中的任何文件相同。我使用“文件签名”对文件进行索引，签名是一种CRC（循环冗余校验）计算的结果（一个Fixnum，但我不记得有多少字节）。它非常快速且100％准确。想法是为每个字符串计算一个签名，然后在索引中查找该签名。 - Cary Swoveland

你是在寻找完全相同的文件还是允许有一定的误差？这是你可以控制的吗？我会进一步研究。感谢你的建议。 - Nicolas M.

你试过这个和这个了吗？ - mdesantis

第一个似乎只计算两个单词之间的距离，并没有实现任何索引/树/ ngrams 数据结构来优化在字典中搜索。我尝试使用第二个时速度相当慢... - Nicolas M.

你能展示一下你的字典吗？ - fl00r

显示剩余2条评论

4个回答

5

我编写了一对宝石，fuzzily和blurrily，它们基于三元组进行模糊匹配。考虑到您的（低）数据量，Fuzzily将更容易集成，并且速度大约相同，在现代硬件上，您可以在5-10ms内获得答案。

考虑到两者都是基于三元组的（可索引的），而不是基于编辑距离的（不可索引的），您可能需要分两个步骤执行：

第一步，向其中一个宝石询问最佳匹配三元组的一组结果；

然后，使用Levenstein将结果与输入字符串进行比较；

并返回该度量的最小值。

在Ruby中（正如您所要求的），使用Fuzzily + Text gem，获取符合编辑距离阈值的记录如下：

MyRecords.find_by_fuzzy_name(input_string).select { |result|
  Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
}

这将执行一些经过优化的数据库查询和几个操作。

注意事项：

- 如果您正在寻找的“最小”编辑距离很高，那么仍会执行大量的Levenshteins。 - 使用trigrams假定您的输入文本是拉丁文或接近拉丁文（基本上是欧洲语言）。 - 由于没有保证“匹配trigram的数量”是“编辑距离”的良好一般近似，因此可能存在一些边缘情况。

- mezis

我目前正在使用Blurrily进行这项工作，因此我具有内存索引，并使用levenshtein-ffi gem（C实现-比我自己的Ruby实现快4倍），并且在我的数据上获得1-2ms的请求，而我在纯Ruby实现中则需要50-200ms。我完全可以接受这些限制，所以这对我来说是完美的解决方案。非常感谢这些惊人的gems！ - Nicolas M.

Nicolas，非常感谢您。如果您发现任何错误，请打开Github问题，我会尽力保持它们的维护！ - mezis

3

如果你准备涉足机器学习方法，那么Geoff Hinton的这篇论文将是一个很好的起点。

http://www.cs.toronto.edu/~hinton/absps/sh.pdf

这种方法在类似Google等地方使用。

基本上，您根据相似性对字典字符串进行聚类。当查询字符串出现时，而不是针对整个数据集计算编辑距离，您只需比较该群集，从而显着减少查询时间。

附言：我做了一些谷歌搜索，在这里找到了另一种类似方法的Ruby实现，称为局部敏感哈希。

- dopplesoldner

2

这是一个原始的 Trie 实现，完全没有优化，只是一个概念证明。这是一个纯 Ruby 实现。

为了测试它，我从这里获取了 100,000 个单词：http://www.infochimps.com/datasets/word-list-100000-official-crossword-words-excel-readable/downloads/195488 以下是它的 gist：https://gist.github.com/fl00r/7542994

class TrieDict
  attr_reader :dict

  def initialize
    @dict = {}
  end

  def put(str)
    d = nil
    str.chars.each do |c|
      d && (d = (d[1][c] ||= [nil, {}])) || d = (@dict[c] ||= [nil, {}])
    end
    d[0] = true
  end

  def fetch(prefix, fuzzy = 0)
    storage = []
    str = ""
    error = 0
    recur_fetch(prefix, fuzzy, @dict, storage, str, error)
    storage
  end

  def recur_fetch(prefix, fuzzy, dict, storage, str, error)
    dict.each do |k, vals|
      e = error
      if prefix[0] != k
        e += 1
        next  if e > fuzzy
      end
      s = str + k
      storage << s  if vals[0] && (prefix.size - 1) <= (fuzzy - e)
      recur_fetch(prefix[1..-1] || "", fuzzy, vals[1], storage, s, e)
    end
  end
end

def bench
  t = Time.now.to_f
  res = nil
  10.times{ res = yield }
  e = Time.now.to_f - t
  puts "Elapsed for 10 times: #{e}"
  puts "Result: #{res}"
end

trie = TrieDict.new
File.read("/home/petr/code/tmp/words.txt").each_line do |word|
  trie.put(word.strip)
end; :ok
# Elapsed for 10 times: 0.0006465911865234375
# Result: ["hello"]
bench{ trie.fetch "hello", 1 }
# Elapsed for 10 times: 0.013643741607666016
# Result: ["cello", "hallo", "helio", "hell", "hello", "hellos", "hells", "hillo", "hollo", "hullo"]
bench{ trie.fetch "hello", 2 }
# Elapsed for 10 times: 0.08267641067504883
# Result: ["bell", "belle", "bellow", "bells", "belly", "cell", "cella", "celli", "cello", "cellos", "cells", "dell", "dells", "delly", "fell", "fella", "felloe", "fellow", "fells", "felly", "hall", "hallo", "halloa", "halloo", "hallos", "hallot", "hallow", "halls", "heal", "heals", "heel", "heels", "heil", "heils", "held", "helio", "helios", "helix", "hell", "helled", "heller", "hello", "helloed", "helloes", "hellos", "hells", "helm", "helms", "helot", "help", "helps", "helve", "herl", "herls", "hill", "hillo", "hilloa", "hillos", "hills", "hilly", "holla", "hollo", "holloa", "holloo", "hollos", "hollow", "holly", "hull", "hullo", "hulloa", "hullos", "hulls", "jell", "jells", "jelly", "mell", "mellow", "mells", "sell", "selle", "sells", "tell", "tells", "telly", "well", "wells", "yell", "yellow", "yells"]
bench{ trie.fetch "engineer", 2 }
# Elapsed for 10 times: 0.04654884338378906
# Result: ["engender", "engine", "engined", "engineer", "engineered", "engineers", "enginery", "engines"]
bench{ trie.fetch "engeneer", 1 }
# Elapsed for 10 times: 0.005484580993652344
# Result: ["engender", "engineer"]

- fl00r

有趣！比我的实现还快。您介意解释一下底层数据结构，特别是这个疯狂的put方法是做什么用的吗？ - Nicolas M.

底层数据结构是一种Trie。Put方法将字符放入其父节点的节点中。节点数组的第一个项目（布尔值）是一个标志，表示这是完整单词（从顶部到此节点）。 - fl00r

1

它可以被重写为一个带有C扩展的gem，以获得更好的性能（我相信可以提高2-5倍）。 - fl00r

阅读此内容的人请注意：该算法的准确性并不是很高。如果您尝试 trie.fetch("ello", 1)，它将不会返回"hello"，因为它没有考虑插入和删除成本，只有替换。它速度快，但在寻找可能出现实际拼写错误的相似单词时并不是非常实用。 - Nicolas M.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- olegarch · Accepted Answer

大约15年前，我写了模糊搜索算法，可以找到N个最接近的邻居。这是对Wilbur三元算法的修改，我将其命名为“Wilbur-Khovayko算法”。

基本思路：通过三元组拆分字符串，搜索最大交集分数。

例如，我们有字符串“hello world”。该字符串生成三元组：hel ell llo“lo”，“o_w”等，并为每个单词生成特殊的前缀/后缀三元组，如$he $ wo lo$ ld$。

然后，为每个三元组构建索引，在其中出现的术语。

因此，这是每个三元组的term_ID列表。

当用户调用某些字符串时，它也会拆分为三元组，并且程序会搜索最大交集分数并生成N大小的列表。

它非常快：我记得在旧的Sun / Solaris上，256MB内存，200MHZ CPU中，在0.25秒内搜索字典中5,000,000个项目中的100个最接近的项。

您可以从以下链接获取我的旧源代码： http://olegh.ftp.sh/wilbur-khovayko.tar.gz 更新：

我创建了新的档案，其中包含适用于现代Linux / BSD make的Makefile。您可以在此处下载新版本：http://olegh.ftp.sh/wilbur-khovayko.tgz 创建一个目录，并在此处提取存档。

mkdir F2
cd F2
tar xvfz wilbur-khovayko.tgz
make

进入测试目录，复制术语列表文件（固定名称为termlist.txt），并创建索引：

 cd test/
 cp /tmp/test/termlist.txt ./termlist.txt
 ./crefdb.exe <termlist.txt

在这个测试中，我使用了约380,000个过期的域名：

wc -l termlist.txt
379430 termlist.txt

运行 findtest 应用程序：

./findtest.exe

boking  <-- this is query -- word "booking" with misspeling


0001:Query: [boking]
  1:  287890 (  3.863739) [bokintheusa.com,2009-11-20,$69]
  2:  287906 (  3.569148) [bookingseu.com,2009-11-20,$69]
  3:  257170 (  3.565942) [bokitko.com,2009-11-18,$69]
  4:  302830 (  3.413791) [bookingcenters.com,2009-11-21,$69]
  5:  274658 (  3.408325) [bookingsadept.com,2009-11-19,$69]
  6:  100438 (  3.379371) [bookingresorts.com,2009-11-09,$69]
  7:  203401 (  3.363858) [bookinginternet.com,2009-11-15,$69]
  8:  221222 (  3.361689) [bobokiosk.com,2009-11-16,$69]
  . . . . 
 97:   29035 (  2.169753) [buccupbooking.com,2009-11-05,$69]
 98:  185692 (  2.169047) [box-hosting.net,2009-11-14,$69]
 99:  345394 (  2.168371) [birminghamcookinglessons.com,2009-11-25,$69]
100:  150134 (  2.167372) [bowlingbrain.com,2009-11-12,$69]