如何在Python中使用gensim BM25排名算法

22

我发现gensim有BM25排名函数,但是我找不到如何使用它的教程。

在我的情况下,我有一个查询和从搜索引擎检索到的几篇文档。如何使用gensim BM25排名来比较查询和文档以找到最相似的那个?

我是gensim的新手。谢谢。

查询:

"experimental studies of creep buckling ."

文档 1:

" the 7 x 7 in . hypersonic wind tunnel at rae farnborough, part 1, design, instrumentation and flow visualization techniques . this is the first of three parts of the calibration report on the r.a.e. some details of the design and lay-out of the plant are given, together with the calculated performance figures, and the major components of the facility are briefly described . the instrumentation provided for the wind-tunnel is described in some detail, including the optical and other methods of flow visualization used in the tunnel . later parts will describe the calibration of the flow in the working-section, including temperature measurements . a discussion of the heater performance will also be included as well as the results of tests to determine starting and running pressure ratios, blockage effects, model starting loads, and humidity of the air flow ."

文档2:

" the 7 in. x 7 in. hypersonic wind tunnel at r.a.e. farnborough part ii. heater performance . tests on the storage heater, which is cylindrical in form and mounted horizontally, show that its performance is adequate for operation at m=6.8 and probably adequate for flows at m=8.2 with the existing nozzles . in its present state, the maximum design temperature of 680 degrees centigrade for operation at m=9 cannot be realised in the tunnel because of heat loss to the outlet attachments of the heater and quick-acting valve which form, in effect, a large heat sink . because of this heat loss there is rather poor response of stagnation temperature in the working section at the start of a run . it is hoped to cure this by preheating the heater outlet cone and the quick-acting valve . at pressures greater than about 100 p.s.i.g. free convection through the fibrous thermal insulation surrounding the heated core causes the top of the heater shell to become somewhat hotter than the bottom, which results in /hogging/ distortion of the shell . this free convection cools the heater core and a vertical temperature gradient is set up across it after only a few minutes at high pressure . modifications to be incorporated in the heater to improve its performance are described ."

文档 3:

" supersonic flow at the surface of a circular cone at angle of attack . formulas for the inviscid flow properties on the surface of a cone at angle of attack are derived for use in conjunction with the m.i.t. cone tables . these formulas are based upon an entropy distribution on the cone surface which is uniform and equal to that of the shocked fluid in the windward meridian plane . they predict values for the flow variables which may differ significantly from the corresponding values obtained directly from the cone tables . the differences in the magnitudes of the flow variables computed by the two methods tend to increase with increasing free-stream mach number, cone angle and angle of attack ."

文件4:

" theory of aircraft structural models subjected to aerodynamic heating and external loads . the problem of investigating the simultaneous effects of transient aerodynamic heating and external loads on aircraft structures for the purpose of determining the ability of the structure to withstand flight to supersonic speeds is studied . by dimensional analyses it is shown that .. constructed of the same materials as the aircraft will be thermally similar to the aircraft with respect to the flow of heat through the structure will be similar to those of the aircraft when the structural model is constructed at the same temperature as the aircraft . external loads will be similar to those of the aircraft . subjected to heating and cooling that correctly simulate the aerodynamic heating of the aircraft, except with respect to angular velocities and angular accelerations, without requiring determination of the heat flux at each point on the surface and its variation with time . acting on the aerodynamically heated structural model to those acting on the aircraft is determined for the case of zero angular velocity and zero angular acceleration, so that the structural model may be subjected to the external loads required for simultaneous simulation of stresses and deformations due to external loads ."
4个回答

22

完全透明,我没有使用BM25排名的经验,但我有相当多的经验与gensim的TF-IDF和LSI分布式模型以及gensim的相似性指数。

作者在保持可读性的代码库方面做得非常好,因此如果您再遇到类似的问题,我建议您直接跳转到源代码。

查看源代码:https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/summarization/bm25.py

所以我用你上面粘贴的文档初始化了一个BM25()对象。

看起来我们的老朋友Radim没有为我们包括计算average_idf的函数,这没什么大不了的,我们可以抄袭第65行:

average_idf = sum(map(lambda k: float(bm25.idf[k]), bm25.idf.keys())) / len(bm25.idf.keys())

那么,如果我正确理解了get_scores的原始意图,您应该通过执行以下操作来获取每个BM25分数与原始查询相关:

scores = bm_25_object.get_scores(query_doc, average_idf)

这将返回每个文档的所有分数,然后,如果我根据我在https://en.wikipedia.org/wiki/Okapi_BM25上阅读到的内容理解BM25排名,您应该能够按如下方式选择具有最高分数的文档:

best_result = docs[scores.index(max(scores))]

因此,第一个文档应该是与您的查询最相关的?无论如何,我希望这正是您所期望的,并且我希望这在某种程度上对您有所帮助。祝你好运!


4
而BM25()的输入是“corpus = [dictionary.doc2bow(text) for text in texts]”,而“get_scores(doc,avg_idf)”的输入“doc”是字典doc2bow(word)的一个数组。 - Lewen

9

鉴于@mkerrig的答案已经过时(2020),这里介绍一种使用gensim 3.8.3和BM25的方法,假设您有一个文档列表docs。此代码返回最佳匹配文档的前10个索引。

from gensim import corpora
from gensim.summarization import bm25

texts = [doc.split() for doc in docs] # you can do preprocessing as removing stopwords
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
bm25_obj = bm25.BM25(corpus)
query_doc = dictionary.doc2bow(query.split())
scores = bm25_obj.get_scores(query_doc)
best_docs = sorted(range(len(scores)), key=lambda i: scores[i])[-10:]

注意,现在您不再需要average_idf参数。


5
请注意安装gensim==3.8.3非常重要,因为gensim在4.0+版本中已经停止支持bm25。 - xhluca

5

@fonfonx提供的答案是可行的,但不是使用BM25的自然方式。 BM25构造函数需要一个List[List[str]]类型的参数,这意味着它需要一个经过分词处理的语料库。

我认为更好的示例应该是这样的:

from gensim.summarization.bm25 import BM25
corpus = ["The little fox ran home",
          "dogs are the best ",
          "Yet another doc ",
          "I see a little fox with another small fox",
          "last doc without animals"]

def simple_tok(sent:str):
    return sent.split()

tok_corpus = [simple_tok(s) for s in corpus]
bm25 = BM25(tok_corpus)
query = simple_tok("a little fox")
scores = bm25.get_scores(query)

best_docs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:3]
for i, b in enumerate(best_docs):
    print(f"rank {i+1}: {corpus[b]}")

输出:

>> rank 1: I see a little fox with another small fox
>> rank 2: The little fox ran home
>> rank 3: dogs are the best 

如何获取索引文档的索引? - J Do

4

我承认上面的回答是正确的。但是,为了其他社区成员的方便,我会补充一些内容。 :)

以下4个链接非常有用,全面覆盖了这个问题。

  1. https://github.com/nhirakawa/BM25 BM25排名函数的Python实现。非常易于使用,我也在我的项目中使用它。效果很好!我认为,这个系统将适用于您的问题。

  2. https://sajalsharma.com/portfolio/cross_language_information_retrieval 显示了Okapi BM25的非常详细和逐步使用,该系统可用于绘制当前系统设计任务的参考文献。

  3. http://lixinzhang.github.io/implementation-of-okapi-bm25-on-python.html 仅针对Okapi BM25的进一步代码。

  4. https://github.com/thunlp/EntityDuetNeuralRanking 实体对神经排名模型。非常适用于研究和学术工作。

平安!

---补充:https://github.com/airalcorn2/RankNet RankNet和LambdaRank


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接