将TF-IDF与预训练的词嵌入结合使用

5
我有一个网站元描述列表(128k个描述,每个描述平均有20-30个单词),我正在尝试构建相似度排名器(例如:显示与此网站元描述最相似的5个站点)。使用TF-IDF单元和二元模型效果非常好,而且我认为通过添加预训练的词嵌入(确切地说是spacy“en_core_web_lg”)可以进一步改善它。但是出现了意外情况:完全不起作用。没有一个好的猜测,它突然间提供完全随机的建议。
以下是我的代码。您认为我可能犯了什么错误?我是否忽略了某些高度直观的东西?
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import sys
import pickle
import spacy
import scipy.sparse
from scipy.sparse import csr_matrix
import math
from sklearn.metrics.pairwise import linear_kernel
nlp=spacy.load('en_core_web_lg')


""" Tokenizing"""
def _keep_token(t):
    return (t.is_alpha and 
            not (t.is_space or t.is_punct or 
                 t.is_stop or t.like_num))
def _lemmatize_doc(doc):
    return [ t.lemma_ for t in doc if _keep_token(t)]

def _preprocess(doc_list):     
    return [_lemmatize_doc(nlp(doc)) for doc in doc_list]
def dummy_fun(doc):
    return doc

# Importing List of 128.000 Metadescriptions:
Web_data=open("./data/meta_descriptions","r", encoding="utf-8")
All_lines=Web_data.readlines()
# outputs a list of meta-descriptions consisting of lists of preprocessed tokens:
data=_preprocess(All_lines) 

# TF-IDF Vectorizer:    
vectorizer = TfidfVectorizer(min_df=10,tokenizer=dummy_fun,preprocessor=dummy_fun,)
    tfidf = vectorizer.fit_transform(data)    
dictionary = vectorizer.get_feature_names()

# Retrieving Word embedding vectors:
temp_array=[nlp(dictionary[i]).vector for i in range(len(dictionary))]

# I had to build the sparse array in several steps due to RAM constraints
# (with bigrams the vocabulary gets as large as >1m 
dict_emb_sparse=scipy.sparse.csr_matrix(temp_array[0])
for arr in range(1,len(temp_array),100000):
    print(str(arr))        
    dict_emb_sparse=scipy.sparse.vstack([dict_emb_sparse, scipy.sparse.csr_matrix(temp_array[arr:min(arr+100000,len(temp_array))])])

# Multiplying the TF-IDF matrix with the Word embeddings: 
tfidf_emb_sparse=tfidf.dot(dict_emb_sparse)

# Translating the Query into the TF-IDF matrix and multiplying with the same Word Embeddings:
query_doc= vectorizer.transform(_preprocess(["World of Books is one of the largest online sellers of second-hand books in the world Our massive collection of over million cheap used books also comes with free delivery in the UK Whether it s the latest book release fiction or non-fiction we have what you are looking for"]))
query_emb_sparse=query_doc.dot(dict_emb_sparse)

# Calculating Cosine Similarities:
cosine_similarities = linear_kernel(query_emb_sparse, tfidf_emb_sparse).flatten()

related_docs_indices = cosine_similarities.argsort()[:-10:-1]

# Printing the Site descriptions with the highest match:    
for ID in related_docs_indices:
    print(All_lines[ID])

我从这个Github仓库中窃取了代码/逻辑部分。 有人看到这里有什么明显错误吗? 非常感谢!

1
你是否使用Spacy中的“嵌入”(embeding)单词,而不是在元描述上训练“嵌入”(embeding)? - Happy Boy
是的,完全正确。我希望这样可以获得更高的准确性。我应该使用自己训练的嵌入吗? - benjo121212
最好重新训练您的元描述嵌入。 - Happy Boy
@benjo121212,你找到解决问题的办法了吗? - StackPancakes
1个回答

3

你应该尝试在自己的语料库上训练嵌入。有很多包:gensim、glove。你还可以使用来自BERT的嵌入,而无需在自己的语料库上重新训练。

你应该知道,不同语料库上的概率分布总是不同的。例如,在关于食品的帖子中,“篮球”的计数与关于体育新闻中该术语的计数非常不同,因此这些语料库中“篮球”的单词嵌入差距巨大。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接