使用Spacy在文档中查找最相似的句子

4
我正在寻找一种解决方案,使用类似于Gensim中的most_similar(),但使用Spacy。 我想使用NLP在句子列表中找到最相似的句子。

我尝试逐个使用Spacy中的similarity()(例如https://spacy.io/api/doc#similarity)进行循环,但这需要很长时间。

更深入地说:

我想将所有这些句子放入图形中(如this),以找到句子簇。

有任何想法吗?


我认为你想要进行聚类 --> 即将相似的事物放在一起 https://zh.wikipedia.org/wiki/%E7%B0%87%E7%B1%BB%E5%88%86%E6%9E%90 - Petr Matuska
1个回答

1
这是一个简单的内置解决方案,您可以使用它:
import spacy

nlp = spacy.load("en_core_web_lg")
text = (
    "Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
    " These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
    " The term semantic similarity is often confused with semantic relatedness."
    " Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
    " My favorite fruit is apples."
)
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
    for j, other in enumerate(doc.sents):
        if j <= i:
            continue
        similarity = sent.similarity(other)
        if similarity > max_similarity:
            max_similarity = similarity
            most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print("and")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")


(来自 维基百科的文本)
它将产生以下输出:
Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
and
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551

请注意来自spacy.io的以下信息:

To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:

- python -m spacy download en_core_web_sm
+ python -m spacy download en_core_web_lg
另请参阅Spacy与Word2Vec中的文档相似性,以了解如何改善相似度分数的建议。

那是2年前的事了,我已经不再从事那方面的工作了。但我做过类似的事情,问题可能出在实现上:gensim最相似使用多线程优化,我猜测。使用这个循环解决方案,它太线性了,对于长语料库,计算时间会非常快地增加。使用轻量级模型可能是一个好的解决方案。 - Heraknos

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接