PyTorch / Gensim - 如何加载预训练的词嵌入？

Question

PyTorch / Gensim - 如何加载预训练的词嵌入？

pythonpytorchneural-networkgensimword-embedding

52

我想用gensim将预训练的word2vec嵌入加载到PyTorch嵌入层中。

如何将gensim加载的嵌入权重加载到PyTorch嵌入层中？

- MBT

6个回答

4

我认为这很容易。只需要将gensim中的嵌入权重复制到对应的PyTorch嵌入层权重之中即可。嵌入层。

您需要确保两个方面是正确的：第一个是权重形状必须正确，第二个是权重必须转换为PyTorch FloatTensor类型。

- jdhao

我不知道构造函数中有一个“_weight”参数，我会尝试一下 - 谢谢！ - MBT

3

我有同样的问题，不过我使用了带有pytorch的torchtext库，因为它可以帮助进行填充、批处理等操作。这是我使用torchtext 0.3.0加载预训练的嵌入，并将其传递给pytorch 0.4.1的方法（pytorch部分使用了blue-phoenox提到的方法）：

import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab

# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)

# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])

# build vocabulary
text_field.build_vocab(dataset)

# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)

# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)

# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))

# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
    ...
    embedding(batch.text)

- robodasha

3

from gensim.models import Word2Vec

model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created

import torch

weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)

- Jibin Mathew

2

感谢您的回复。我已经查看了gensim以检查您的方法。在这里查看gensim页面：https://radimrehurek.com/gensim/models/word2vec.html#usage-examples 它说Word2Vec模型仅用于训练单词向量，因为该格式比KeyedVectors慢得多。完成训练后，通常将其保存到KeyedVectors模型中。该模型专门用于保存预训练向量，结果是一个比Word2Vec模型更小且更快的对象。您可以这样做，但我认为没有使用这种方式的好处。 - MBT

1

谢谢，@blue-phoenox 我已经阅读过了，我在编写这段代码时假设嵌入是直接创建并立即使用的，而不是从文件加载。 - Jibin Mathew

2

当然你可以这样做。但这意味着每次开始训练过程时，你也会训练嵌入。这只是浪费计算资源，而不是真正的预训练嵌入的想法。当我创建模型时，通常会多次运行它们，我不希望在启动模型的训练过程时每次都重新训练我的预训练嵌入。 - MBT

2

主要重点在于torch部分，因此我让读者自行处理gensim模型和加载。开发人员可能会遇到这样的情况，即可以在创建后立即使用gensim模型。 - Jibin Mathew

2

我只是想指出，在这种情况下，向量实际上并没有预先训练。在您的代码示例中，它并没有加载预先训练的向量，而是训练了新的单词向量。我只是想知道是否有另一种用例，因此我才问。 - MBT

1

我遇到了类似的问题：“在使用gensim训练和保存嵌入向量为二进制格式后，如何将它们加载到torchtext中？”

我只是将文件保存为txt格式，然后按照优秀的tutorial加载自定义词嵌入的教程操作。

def convert_bin_emb_txt(out_path,emb_file):
    txt_name = basename(emb_file).split(".")[0] +".txt"
    emb_txt_file = os.path.join(out_path,txt_name)
    emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
    emb_model.save_word2vec_format(emb_txt_file,binary=False)
    return emb_txt_file

emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                  cache='custom_embeddings',
                                  unk_init=torch.Tensor.normal_)

TEXT.build_vocab(train_data,
                 max_size=MAX_VOCAB_SIZE,
                 vectors=custom_embeddings,
                 unk_init=torch.Tensor.normal_)

测试版本：PyTorch: 1.2.0 和 TorchText: 0.4.0。

我添加了这个答案，因为对于已接受的答案，我不确定如何按照链接的 tutorial 初始化所有未嵌入的单词，并如何使向量和等于零。

- Damianos P. Melidis

0

我自己在理解文档方面遇到了很多问题，而且周围没有太多好的例子。希望这个例子能帮助其他人。它是一个简单的分类器，使用预训练的嵌入在matrix_embeddings中。通过将requires_grad设置为false，我们确保不会改变它们。

class InferClassifier(nn.Module):
  def __init__(self, input_dim, n_classes, matrix_embeddings):
    """initializes a 2 layer MLP for classification.
    There are no non-linearities in the original code, Katia instructed us 
    to use tanh instead"""

    super(InferClassifier, self).__init__()

    #dimensionalities
    self.input_dim = input_dim
    self.n_classes = n_classes
    self.hidden_dim = 512

    #embedding
    self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
    self.embeddings.requires_grad = False

    #creates a MLP
    self.classifier = nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            nn.Tanh(), #not present in the original code.
            nn.Linear(self.hidden_dim, self.n_classes))

  def forward(self, sentence):
    """forward pass of the classifier
    I am not sure it is necessary to make this explicit."""

    #get the embeddings for the inputs
    u = self.embeddings(sentence)

    #forward to the classifier
    return self.classifier(x)

sentence 是一个向量，其中包含的是 matrix_embeddings 的索引而不是单词。

- Victor Zuanazzi

1

你的意思是 self.classifier(u) 吗？ - z.ghane

你如何获取这些句子的索引？ - Daniel Wyatt

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MBT · Accepted Answer

我想报告一下我的发现，关于使用PyTorch加载gensim嵌入的解决方案：

PyTorch 0.4.0及更高版本的解决方案:

从版本v0.4.0开始，有一个新的函数from_pretrained()可以非常方便地加载嵌入。以下是来自文档的示例。

import torch
import torch.nn as nn

# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)

可以通过以下方法轻松获取gensim的权重：

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated

正如@Guglie所指出的那样，在更新的gensim版本中，权重可以通过model.wv获得：

weights = model.wv

PyTorch版本 0.3.1 及更早版本的解决方案：

我正在使用版本 0.3.1，但在这个版本中没有可用的from_pretrained()函数。

因此，我自己编写了from_pretrained函数，以便我可以在0.3.1版本中使用它。

适用于 PyTorch 版本 0.3.1 或更低版本的 from_pretrained 代码：

def from_pretrained(embeddings, freeze=True):
    assert embeddings.dim() == 2, \
         'Embeddings parameter is expected to be 2-dimensional'
    rows, cols = embeddings.shape
    embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
    embedding.weight = torch.nn.Parameter(embeddings)
    embedding.weight.requires_grad = not freeze
    return embedding

嵌入式代码可以像这样加载：

embedding = from_pretrained(weights)

我希望这对某个人有所帮助。

PyTorch / Gensim - 如何加载预训练的词嵌入？

PyTorch 0.4.0及更高版本的解决方案:

PyTorch版本 0.3.1 及更早版本的解决方案：

PyTorch版本 `0.3.1` 及更早版本的解决方案：