如果您正在寻找最简单/最快的解决方案,那么我建议您使用预训练的词嵌入(Word2Vec或GloVe),并在其上构建一个简单的查询系统。这些向量已经在一个巨大的语料库上进行了训练,很可能包含足够好的近似值以适应您的领域数据。
以下是我的解决方案:
import numpy as np
data = {
'Names': ['john','jay','dan','nathan','bob'],
'Colors': ['yellow', 'red','green'],
'Places': ['tokyo','bejing','washington','mumbai'],
}
categories = {word: key for key, words in data.items() for word in words}
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
for line in f:
values = line.split()
word = values[0]
embed = np.array(values[1:], dtype=np.float32)
embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}
def process(query):
query_embed = embeddings_index[query]
scores = {}
for word, embed in data_embeddings.items():
category = categories[word]
dist = query_embed.dot(embed)
dist /= len(data[category])
scores[category] = scores.get(category, 0) + dist
return scores
print(process('pink'))
print(process('frank'))
print(process('moscow'))
要运行它,您需要从这里下载并解压预训练的GloVe数据(注意,大小为800MB!)。 运行后,应该会产生类似于以下内容:
{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}
...看起来相当合理。就是这样!如果您不需要如此庞大的模型,可以根据它们的tf-idf得分过滤glove
中的单词。请记住,模型大小仅取决于您拥有的数据和可能要查询的单词。