Word2vec 微调

Question

Word2vec 微调

pythonmachine-learningdata-scienceword2vec

6

我需要微调我的word2vec模型。我有两个数据集，data1和data2。

到目前为止我所做的是：

model = gensim.models.Word2Vec(
        data1,
        size=size_v,
        window=size_w,
        min_count=min_c,
        workers=work)
model.train(data1, total_examples=len(data1), epochs=epochs)

model.train(data2, total_examples=len(data2), epochs=epochs)

这个正确吗？我需要把学习到的权重存储在某个地方吗？

我查看了这个答案和这个，但我无法理解如何操作。

可以有人向我解释一下应该遵循哪些步骤吗？

- Samorix

3个回答

3

这是正确的吗？

是的，它是正确的。您需要确保data2中的单词包含在data1提供的词汇表中。如果不是，则没有出现在词汇表中的单词将丢失。

请注意，将由以下代码计算的权重：

model.train(data1, total_examples=len(data1), epochs=epochs)

和

model.train(data2, total_examples=len(data2), epochs=epochs)

与

model.train(data1+data2, total_examples=len(data1+data2), epochs=epochs)

不相等。

我需要将学习到的权重存储在某个地方吗？

不需要，您不需要这样做。

但是，如果您想要，可以将权重保存为文件，以便以后使用。

model.save("word2vec.model")

你可以通过以下方式加载它们

model = Word2Vec.load("word2vec.model")

(来源)

我需要微调我的word2vec模型。

请注意，“Word2vec训练是一项无监督的任务，没有很好的方法来客观评估结果。评估取决于您的最终应用。”(来源) 但是有一些评估指标可以在这里查看（“如何衡量单词向量的质量”部分）。

希望这有所帮助！

- Arkady. A

3

当你使用gensim训练w2v模型时，它会存储每个单词的vocab和index信息。gensim使用这些信息将单词映射到其向量。

如果您要微调已有的w2v模型，则需要确保您的vocab是一致的。

参见附带的代码片段。

import os
import pickle
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import operator

os.mkdir("model_dir")

# class EpochSaver(CallbackAny2Vec):
#     '''Callback to save model after each epoch.'''
#     def __init__(self, path_prefix):
#         self.path_prefix = path_prefix
#         self.epoch = 0

#     def on_epoch_end(self, model):
#         list_of_existing_files = os.listdir(".")
#         output_path = 'model_dir/{}_epoch{}.model'.format(self.path_prefix, self.epoch)
#         try:
#             model.save(output_path)
#         except:
#             model.wv.save_word2vec_format('model_dir/model_{}.bin'.format(self.epoch), binary=True)
#         print("number of epochs completed = {}".format(self.epoch))
#         self.epoch += 1
#         list_of_total_files = os.listdir(".")

# saver = EpochSaver("my_finetuned")





# function to load vectors from existing model.
# I am loading glove vectors from a text file, benefit of doing this is that I get complete vocab of glove as well.
# If you are using a previous word2vec model I would recommed save that in txt format.
# In case you decide not to do it, you can tweak the function to get vectors for words in your vocab only.
def load_vectors(token2id, path,  limit=None):
    embed_shape = (len(token2id), 300)
    freqs = np.zeros((len(token2id)), dtype='f')

    vectors = np.zeros(embed_shape, dtype='f')
    i = 0
    with open(path, encoding="utf8", errors='ignore') as f:
        for o in f:
            token, *vector = o.split(' ')
            token = str.lower(token)
            if len(o) <= 100:
                continue
            if limit is not None and i > limit:
                break
            vectors[token2id[token]] = np.array(vector, 'f')
            i += 1

    return vectors


embedding_name = "glove.840B.300d.txt"
data = "<training data(new line separated tect file)>"

# Dictionary to store a unique id for each token in vocab( in my case vocab contains both my vocab and glove vocab)
token2id = {}

# This dictionary will contain all the words and their frequencies.
vocab_freq_dict = {}

# Populating vocab_freq_dict and token2id from my data.
id_ = 0
training_examples = []
file = open("{}".format(data),'r', encoding="utf-8")
for line in file.readlines():
    words = line.strip().split(" ")
    training_examples.append(words)
    for word in words:
        if word not in vocab_freq_dict:
            vocab_freq_dict.update({word:0})
        vocab_freq_dict[word] += 1
        if word not in token2id:
            token2id.update({word:id_})
            id_ += 1

# Populating vocab_freq_dict and token2id from glove vocab.
max_id = max(token2id.items(), key=operator.itemgetter(1))[0]
max_token_id = token2id[max_id]
with open(embedding_name, encoding="utf8", errors='ignore') as f:
    for o in f:
        token, *vector = o.split(' ')
        token = str.lower(token)
        if len(o) <= 100:
            continue
        if token not in token2id:
            max_token_id += 1
            token2id.update({token:max_token_id})
            vocab_freq_dict.update({token:1})

with open("vocab_freq_dict","wb") as vocab_file:
    pickle.dump(vocab_freq_dict, vocab_file)
with open("token2id", "wb") as token2id_file:
    pickle.dump(token2id, token2id_file)



# converting vectors to keyedvectors format for gensim
vectors = load_vectors(token2id, embedding_name)
vec = KeyedVectors(300)
vec.add(list(token2id.keys()), vectors, replace=True)

# setting vectors(numpy_array) to None to release memory
vectors = None

params = dict(min_count=1,workers=14,iter=6,size=300)

model = Word2Vec(**params)

# using build from vocab to build the vocab
model.build_vocab_from_freq(vocab_freq_dict)

# using token2id to create idxmap
idxmap = np.array([token2id[w] for w in model.wv.index2entity])

# Setting hidden weights(syn0 = between input layer and hidden layer) = your vectors arranged accoring to ids
model.wv.vectors[:] = vec.vectors[idxmap]

# Setting hidden weights(syn0 = between hidden layer and output layer) = your vectors arranged accoring to ids
model.trainables.syn1neg[:] = vec.vectors[idxmap]


model.train(training_examples, total_examples=len(training_examples), epochs=model.epochs)
output_path = 'model_dir/final_model.model'
model.save(output_path)

如果您有任何疑问，请留言评论。

- ashutosh singh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- gojomo · Accepted Answer

请注意，如果您在模型实例化时已经提供了data1，则不需要再使用data1调用train()。如果您没有在实例化时指定epochs，则模型将已经对提供的语料库进行了自己的内部build_vocab()和train()，使用默认的epochs(5)。

“微调”并非是一种可靠的步骤，可以保证改进模型。它非常容易出错。

特别是，如果data2中的单词对模型来说是未知的，则会被忽略。(有一个选项可以使用参数update=True调用build_vocab()来扩展已知词汇，但这些单词与之前的单词并没有完全相同。)

如果data2包含一些单词，而其他单词则没有，那么只有data2中的单词通过额外的训练得到更新 - 这可能会从只出现在data1中的其他单词中将那些单词拉出来。(只有一起进行交替共享培训的单词才会经历最终使它们处于有用排列的“推-拉”过程。)

增量训练的最安全方法是将data1和data2混洗在一起，并在所有数据上进行持续训练：这样所有单词都会得到新的交替培训。