弹性搜索获取最佳非精确匹配

Question

弹性搜索获取最佳非精确匹配

elasticsearch

4

使用elasticsearch-dsl，我正在尝试搜索与公司名称最相似的匹配项，但要排除完全匹配的项。

例如，我想搜索类似于“Greater London Authority（GLA）”的名称，但我希望所有完全匹配的项都被过滤掉或在评分中得到显著降低。

为了澄清，在我的索引中，我知道字符串“Greater London Authority”存在，并希望将其作为比原始字符串更好的结果返回（原始字符串也在索引中）。

目前我有：

mn =  Q({
    "bool": {
      "must_not": [
        {
          "match": {
            "buyer": entity_name
          }
        }
      ]
    }
  }
)

s = Search(using=es, index="ccs_notices9") \
          .query("match", buyer=entity_name)\
          .query(mn)
         
results = s.execute(s)
results.to_dict()

但是我没有得到任何结果，这很合理，因为我基本上颠倒了两个查询。我尝试在 mn 查询中使用"term"替换"match"，但这是不允许的。我还尝试了更简单的：

s = Search(using=es, index="ccs_notices9") \
          .query("match", buyer=entity_name)\
          .exclude("term", buyer=entity_name)

这确实给我结果，但仍然包括上面那个字符串。

- ML_Engine

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kamal Kunjapur · Accepted Answer

您需要使用两个不同的字段才能实现您要寻找的内容。简而言之，像下面的用例一样，在buyer中使用multi-fields。

映射：

PUT my_exact_match_exclude
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "buyer": {
        "type": "text",
        "fields": {
          "keyword": {                         <---- Note this
            "type": "keyword", 
            "normalizer": "my_normalizer"      <---- Note this. To take care of case sensitivity    
          }
        }
      }
    }
  }
}

请注意，城市的映射具有使用多字段的keyword数据类型的兄弟字段。

此外，请阅读关于规范化程序的内容，以及为什么我在keyword上应用它只是为了确保在进行精确匹配时考虑大小写不敏感性。

样例文档：

POST my_exact_match_exclude/_doc/1
{
  "buyer": "Greater London Authority (GLA)"
}

POST my_exact_match_exclude/_doc/2
{
  "buyer": "Greater London Authority"
}

POST my_exact_match_exclude/_doc/3
{
  "buyer": "Greater London"
}

POST my_exact_match_exclude/_doc/4
{
  "buyer": "London Authority"
}

POST my_exact_match_exclude/_doc/5
{
  "buyer": "greater london authority (GLA)"
}

请注意，如果考虑不区分大小写，则第一个和最后一个文件完全相同。

样例查询：

POST my_exact_match_exclude/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "buyer": "Greater London Authority (GLA)"
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "buyer.keyword": "Greater London Authority (GLA)".         
          }
        }
      ]
    }
  }
}

请注意，我正在对buyer.keyword字段应用must_not，以避免所有精确匹配的术语。

示例响应：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.66237557,
    "hits" : [
      {
        "_index" : "my_exact_match_exclude",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.66237557,
        "_source" : {
          "buyer" : "Greater London Authority"
        }
      },
      {
        "_index" : "my_exact_match_exclude",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.4338556,
        "_source" : {
          "buyer" : "Greater London"
        }
      },
      {
        "_index" : "my_exact_match_exclude",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.4338556,
        "_source" : {
          "buyer" : "London Authority"
        }
      }
    ]
  }
}

正如预期的那样，文档1和5没有返回结果，因为它们是完全匹配的。

您可以在代码中以类似的方式使用上述查询。

希望这可以帮助您！