能否将预训练好的word2vec向量加载到Spark中？

Question

能否将预训练好的word2vec向量加载到Spark中？

5

有没有一种方法可以将Google的或Glove的预训练向量（模型）例如GoogleNews-vectors-negative300.bin.gz加载到spark中，并执行像从spark提供的findSynonyms这样的操作？还是我需要从头开始进行加载和操作？

在这篇文章在Spark中加载Word2Vec模型中，Tom Lous建议将bin文件转换为txt并从那里开始，我已经做到了..但接下来呢？

在我昨天发布的一个问题中，我得到了一个答案，即Parquet格式的模型可以在spark中加载，因此我发布这个问题以确保没有其他选项。

- Mike_Jr

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- chairbender · Accepted Answer

免责声明：我对Spark还不是很熟悉，但以下方法至少对我有效。

关键在于如何从一组词向量构建Word2VecModel以及处理尝试以此方式创建模型时的一些要点。

首先，将你的词向量加载到一个Map中。例如，我已将我的词向量保存为parquet格式（在名为“wordvectors.parquet”的文件夹中），其中“term”列保存String类型的单词，而“vector”列保存数组[float]类型的向量，我可以在Java中这样加载它：

// Loads the dataset with the "term" column holding the word and the "vector" column 
// holding the vector as an array[float] 
Dataset<Row> vectorModel = pSpark.read().parquet("wordvectors.parquet");

//convert dataset to a map.
Map<String, List<Float>> vectorMap = Arrays.stream((Row[])vectorModel.collect())
            .collect(Collectors.toMap(row -> row.getAs("term"), row -> row.getList(1)));

//convert to the format that the word2vec model expects float[] rather than List<Float>
Map<String, float[]> word2vecMap = vectorMap.entrySet().stream()
                .collect(Collectors.toMap(Map.Entry::getKey, entry -> (float[]) Floats.toArray(entry.getValue())));

//need to convert to scala immutable map because that's what word2vec needs
scala.collection.immutable.Map<String, float[]> scalaMap = toScalaImmutableMap(word2vecMap);

private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(Map<K, V> pFromMap) {
        final List<Tuple2<K,V>> list = pFromMap.entrySet().stream()
                .map(e -> Tuple2.apply(e.getKey(), e.getValue()))
                .collect(Collectors.toList());

        Seq<Tuple2<K,V>> scalaSeq = JavaConverters.asScalaBufferConverter(list).asScala().toSeq();

        return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(scalaSeq);
    }

现在你可以从头开始构建模型。由于Word2VecModel的工作方式有些古怪，你必须手动设置向量大小，并以一种奇怪的方式进行设置。否则它会默认为100，当尝试调用.transform()时会出现错误。以下是我发现的一种可行方法，不确定是否所有步骤都是必要的：

现在您可以从头开始构建模型。由于Word2VecModel的工作方式有些古怪，您必须手动设置向量大小，并以一种奇怪的方式进行设置。否则，它将默认为100，当尝试调用.transform()时会出现错误。以下是我发现的一种有效方法，但不确定是否需要所有步骤：

 //not used for fitting, only used for setting vector size param (not sure if this is needed or if result.set is enough
Word2Vec parent = new Word2Vec();
parent.setVectorSize(300);

Word2VecModel result = new Word2VecModel("w2vmodel", new org.apache.spark.mllib.feature.Word2VecModel(scalaMap)).setParent(parent);
        result.set(result.vectorSize(), 300);

现在您应该能够像使用自己训练的模型一样使用result.transform()。我没有测试其他Word2VecModel函数是否正常工作，只测试了.transform()。最初的回答。