MongoDB全文搜索分数“Score”是什么意思?

7
我正在为我的学校开展一个 MongoDB 项目。我有一组句子,并且通过普通文本搜索来查找集合中最相似的句子,这是基于得分的。
我运行了这个查询。
db.sentences.find({$text: {$search: "any text"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

看一下我查询句子时的结果。
"that kicking a dog causes it pain"
----Matched With
"that kicking a dog causes it pain – is not very controversial."
----Give a Result of:
*score: 2.4*


"This sentence have nothing to do with any other"
----Matched With
"Who is the “He” in this sentence?"
----Give a result of:
*Score: 1.0* 

什么是分数值?它代表什么意思? 如果我想要显示相似度只有70%及以上的结果会怎样?
我该如何解释分数结果以便能够显示相似度百分比呢?我正在使用C#来做这个,但不用担心实现。我不介意一个伪代码解决方案!

1
相似度70%是什么意思?您想使用什么样的分数来衡量相似性? - kraskevich
1
我正在尝试制作一款抄袭软件,您可以上传文档,然后每个句子将与一组句子进行比较。因此,当最高分数的句子相似度达到70%或更高时,就有可能存在抄袭行为。 - Nasri Yatim
1
@NasriYatim,你找到方法了吗? - chrizonline
嗨,Nasri,我也是MongoDB的新手,我需要从名字字段中搜索“Raja Sekar”,我已经对其进行了索引。但我的条件是搜索条件应该与相似记录的75%匹配。你能帮我吗? - rajsekar
2个回答

7

当您使用MongoDB文本索引时,它会为每个匹配的文档生成一个得分。该得分表示搜索字符串与文档匹配程度的强度。分数越高,与搜索文本相似的可能性就越大。得分是通过以下方式计算的:

Step 1: Let the search text = S
Step 2: Break S into tokens (If you are not doing a Phrase search). Let's say T1, T2..Tn. Apply Stemming to each token
Step 3: For every search token, calculate score per index field of text index as follows:
       
score = (weight * data.freq * coeff * adjustment);
       
Where :
weight = user Defined Weight for any field. Default is 1 when no weight is specified
data.freq = how frequently the search token appeared in the text
coeff = ​(0.5 * data.count / numTokens) + 0.5
data.count = Number of matching token
numTokens = Total number of tokens in the text
adjustment = 1 (By default).If the search token is exactly equal to the document field then adjustment = 1.1
Step 4: Final score of document is calculated by adding all tokens scores per text index field
Total Score = score(T1) + score(T2) + .....score(Tn)

所以从上面可以看出,得分受以下因素影响:

  1. 与实际搜索文本匹配的术语数目,匹配越多,得分就越高
  2. 文档字段中的标记数量
  3. 搜索文本是否恰好匹配文档字段

以下是您的一份文档的推导过程:

Search String = This sentence have nothing to do with any other
Document = Who is the “He” in this sentence?

Score Calculation:
Step 1: Tokenize search string.Apply Stemming and remove stop words.
    Token 1: "sentence"
    Token 2: "nothing"
Step 2: For every search token obtained in Step 1, do steps 3-11:
        
      Step 3: Take Sample Document and Remove Stop Words
            Input Document:  Who is the “He” in this sentence?
            Document after stop word removal: "sentence"
      Step 4: Apply Stemming 
        Document in Step 3: "sentence"
        After Stemming : "sentence"
      Step 5: Calculate data.count per search token 
              data.count(sentence)= 1
              data.count(nothing)= 1
      Step 6: Calculate total number of token in document
              numTokens = 1
      Step 7: Calculate coefficient per search token
              coeff = ​(0.5 * data.count / numTokens) + 0.5
              coeff(sentence) =​ 0.5*(1/1) + 0.5 = 1.0
              coeff(nothing) =​ 0.5*(1/1) + 0.5 = 1.0    
      Step 8: Calculate adjustment per search token (Adjustment is 1 by default. If the search text match exactly with the raw document only then adjustment = 1.1)
              adjustment(sentence) = 1
              adjustment(nothing) =​ 1
      Step 9: weight of field (1 is default weight)
              weight = 1
      Step 10: Calculate frequency of search token in document (data.freq)
           For ever ith occurrence, the data frequency = 1/(2^i). All occurrences are summed.
            a. Data.freq(sentence)= 1/(2^0) = 1
            b. Data.freq(nothing)= 0
      Step 11: Calculate score per search token per field:
         score = (weight * data.freq * coeff * adjustment);
         score(sentence) = (1 * 1 * 1.0 * 1.0) = 1.0
         score(nothing) = (1 * 0 * 1.0 * 1.0) = 0
Step 12: Add individual score for every token of search string to get total score
Total score = score(sentence) + score(nothing) = 1.0 + 0.0 = 1.0 

同样地,你可以推导出另一个。

如果需要更详细的 MongoDB 分析,请查看: Mongo Scoring Algorithm Blog


2

文本搜索为包含索引字段中搜索词的每个文档分配一个分数,该分数确定文档与给定搜索查询的相关性。

对于文档中的每个索引字段,MongoDB将匹配次数乘以权重并求和。然后使用这个总和,MongoDB计算文档的得分。

索引字段的默认权重为1。

https://docs.mongodb.com/manual/tutorial/control-results-of-text-search/


6
不要剽窃,用例子来解释会更有帮助。 - chrizonline

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接