使用spaCy将单词向量映射到最相似/最接近的单词

Question

使用spaCy将单词向量映射到最相似/最接近的单词

nlpspacyword2vecword-embedding

11

我在使用spaCy作为主题建模解决方案的一部分，并且我有这样一种情况：需要将派生的词向量映射到单词向量词汇表中“最接近”或“最相似”的单词。

我看到gensim有一个函数（WordEmbeddingsKeyedVectors.similar_by_vector）可以计算这个，但我想知道spaCy是否有类似的功能将向量映射到其词汇库（nlp.vocab）中的单词？

- Eric Broda

5个回答

12

经过一些尝试，我找到了一个scikit函数（scikit.spatial.distance中的cdist），它可以在向量空间中找到与输入向量“接近”的向量。

# Imports
from scipy.spatial import distance
import spaCy

# Load the spacy vocabulary
nlp = spacy.load("en_core_web_lg")

# Format the input vector for use in the distance function
# In this case we will artificially create a word vector from a real word ("frog")
# but any derived word vector could be used
input_word = "frog"
p = np.array([nlp.vocab[input_word].vector])

# Format the vocabulary for use in the distance function
ids = [x for x in nlp.vocab.vectors.keys()]
vectors = [nlp.vocab.vectors[x] for x in ids]
vectors = np.array(vectors)

# *** Find the closest word below ***
closest_index = distance.cdist(p, vectors).argmin()
word_id = ids[closest_index]
output_word = nlp.vocab[word_id].text
# output_word is identical, or very close, to the input word

- Eric Broda

7

这个答案需要注意一点。传统上，Word相似度（在gensim、spacy和nltk中）使用余弦相似度，而默认情况下，scipy的cdist使用欧几里得距离。你可以获得余弦距离，它不同于相似度，但它们是相关的。要复制gensim的计算，请将cdist调用更改为以下内容：

distance.cdist(p, vectors, metric='cosine').argmin()

然而，你也应该注意到scipy测量的是余弦距离，这与余弦相似度是“反向”的，其中“余弦距离”= 1-cos x（x是向量之间的角度），因此要匹配/重复gensim数字，必须从1中减去你的答案（当然，取MAX参数-相似的向量更接近于1）。这是一个非常微妙的差别，但可能会引起很大的困惑。

相似的向量应具有较大（接近1）的相似性，而距离应小（接近零）。

余弦相似度可以为负数（意味着向量具有相反的方向），但它们的距离将为正数（因为距离应该为正数）。

source: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.n_similarity.html#gensim.models.Word2Vec.n_similarity

在Spacy中进行相似性比较的方法如下：

import spacy
nlp = spacy.load("en_core_web_md")
x = nlp("man")
y = nlp("king")
print(x.similarity(y))
print(x.similarity(x))

- RDS

x.similarity 对于我来说足够快，可以在词汇表中迭代所有单词，适用于少量情况。 - z0r

1

这是一个使用300维特征向量（32位浮点数占用1.2kB）进行相似度搜索的示例。

你可以将单词向量存储在几何数据结构sklearn.neighbors.BallTree中，以显著加快搜索速度，同时避免与k-d树相关的高维损失（当维度超过~100时不会有加速）。如果需要避免加载spaCy，则可以轻松地对其进行pickle和unpickle，并将其保存在内存中。有关实现详细信息，请参见下面的演示和源代码。

其他使用线性搜索的答案可行（但如果您的向量中有任何一个为零，请注意使用余弦相似度），但对于大词汇量来说速度会很慢。 spaCy 的 en_core_web_lg 库具有约 680k 个带有单词向量的单词。由于每个单词通常只有几个字节，因此这可能导致几 GB 的内存使用。

我们可以使用单词频率表来使搜索不区分大小写，并删除不常见的单词（从 v3.0 开始，spaCy 已经内置了该表，但现在必须单独加载它们），以将词汇表缩小到约 100k 个单词。然而，搜索仍然是线性的，可能需要几秒钟时间，这可能无法接受。

有一些库可以快速进行相似性搜索，但安装起来可能相当麻烦和复杂，并且适用于具有 MB 或 GB 级别特征向量的 GPU 加速等等。

我们可能不希望每次运行应用程序时都加载整个 spaCy 词汇表，因此我们会根据需要对词汇表进行 pickle/unpickle。

import spacy, numpy, pickle
import sklearn.neighbors as nbs

#load spaCy
nlp=spacy.load("en_core_web_lg")

#load lexeme probability table
lookups = load_lookups("en", ["lexeme_prob"])
nlp.vocab.lookups.add_table("lexeme_prob", lookups.get_table("lexeme_prob"))

#get lowercase words above frequency threshold with vectors, min_prob=-20
words = [word for word in nlp.vocab.strings if nlp.vocab.has_vector(word) and word.islower() and nlp.vocab[word].prob >= -18]
wordvecs = numpy.array([nlp.vocab.get_vector(word) for word in words])  #get wordvectors
tree = nbs.BallTree(wordvecs)  #create the balltree
dict = dict(zip(words,wordvecs))  #create word:vector dict

在削减词汇量后，我们可以将单词、字典和balltree进行数据序列化，并在需要时加载它们，而无需再次加载spaCy：

#pickle/unpickle the balltree if you don't want to load spaCy
with open('balltree.pkl', 'wb') as f:
        pickle.dump(tree,f,protocol=pickle.HIGHEST_PROTOCOL)
#...
#load wordvector balltree from pickle file
with open('./balltree.pkl','rb') as f:
    tree = pickle.load(f)

给定一个单词，获取其词向量，搜索树以找到最接近单词的索引，然后使用字典查找该单词：

#get wordvector and lookup nearest words
def nearest_words(word):
    #get vectors for all words
        try:
            vec = to_vec[word]
        #if word is not in vocab, set to zero vector
        except KeyError:
            vec = numpy.zeros(300)

    #perform nearest neighbor search of wordvector vocabulary
    dist, ind = tree.query([vec],10)

    #lookup nearest words using indices from tree
    near_words = [vocab[i] for i in ind[0]]

    return near_words

- Jackson Walters

1

# python -m spcay download en_core_web_md
import spacy
nlp = spacy.load('en_core_web_md')
word = 'overflow'
nwords = 10
doc = nlp(word)
vector = doc.vector
vect2word = lambda idx: nlp.vocab.strings[idx]
print([vect2word(simword) for simword in nlp.vocab.vectors.most_similar(vector.reshape(1,-1), n=nwords)[0][0]])

- Zack Dai

你的回答可以通过提供更多支持信息来改进。请编辑以添加进一步的细节，例如引用或文档，以便他人可以确认你的答案是正确的。您可以在帮助中心找到有关如何编写良好答案的更多信息。 - Community

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Amir · Accepted Answer

是的，spacy有一种API方法可以做到这一点，就像KeyedVectors.similar_by_vector一样：

import numpy as np
import spacy

nlp = spacy.load("en_core_web_lg")

your_word = "king"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
['King', 'KIng', 'king', 'KING', 'kings', 'KINGS', 'Kings', 'PRINCE', 'Prince', 'prince']

（sm_core_web_lg中的单词未经过适当规范化，但您可以尝试其他模型并观察更具代表性的输出结果。）