Elasticsearch模糊匹配中的max_expansions和min_similarity是什么？

Question

Elasticsearch模糊匹配中的max_expansions和min_similarity是什么？

elasticsearchfuzzy-searchfuzzy-logicfuzzy-comparison

18

我正在项目中使用模糊匹配主要是为了查找拼写错误和同一名称的不同拼写。我需要准确地了解Elasticsearch的模糊匹配是如何工作以及它如何使用标题中提到的两个参数。

据我了解，min_similarity是查询字符串与数据库中字符串匹配的百分比。我没有找到关于如何计算此值的精确描述。

据我理解，max_expansions是执行搜索的Levenshtein距离。如果这实际上是Levenshtein距离，那么它将成为我的理想解决方案。无论如何，它并不能很好地运行，例如，如果我有单词“Samvel”。

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

文档中说了些我不太理解的东西：

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

请问有人能解释一下这些参数如何影响搜索结果吗？

- Yervand Aghababyan

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DrTech · Accepted Answer

min_similarity是介于零和一之间的一个值。引用自Lucene文档：

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

所谓的“编辑距离”是指Levenshtein距离。

这个查询内部的工作方式是：

它查找所有在索引中存在并且考虑了min_similarity的搜索词可能匹配的术语
然后搜索所有这些术语。

您可以想象一下这个查询有多么繁重！

为了解决这个问题，您可以设置max_expansions参数以指定应考虑的最大匹配项数。