从Pyspark LDA模型中提取文档-主题矩阵

Question

从Pyspark LDA模型中提取文档-主题矩阵

16

我已经成功地通过Python API在Spark中训练了一个LDA模型：

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

这段代码本身是完全正常工作的，但我现在需要LDA模型的文档-主题矩阵，但据我所知，我只能使用model.topicsMatrix()获得单词-主题矩阵。

是否有一些方法可以从LDA模型中获取文档-主题矩阵？如果没有，除了从头开始实现LDA之外，在Spark中是否有其他替代方法可以运行LDA模型并给出我所需的结果？

编辑：

经过一番探索，我在Java api的DistributedLDAModel文档中找到了topicDistributions()，我认为这正是我需要的（但我不确定Pyspark中的LDAModel实际上是否在底层是一个DistributedLDAModel...）。

无论如何，我能够通过以下方式间接调用此方法，而没有任何明显的失败：

In [127]: model.call('topicDistributions')
Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480

但是，如果我实际查看结果，我得到的只是一些告诉我结果实际上是Scala元组（我想）的字符串：

In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'}]

也许这通常是正确的方法，但是否有获取实际结果的方法？

- moustachio

1

我知道Spark中的LDA功能仍在开发中，但没有直接获取模型信息的简单方法似乎很奇怪... - moustachio

我认为这里还有另一个问题。正如Jason Lenderman指出的那样，Spark LDA没有实现LSA，而是一种PLSI的变体。这使得这些矩阵直接变得不太有用。请参见https://dev59.com/jVwY5IYBdhLWcg3weXgb#32953813。 - zero323

我明白了，但在这种情况下，一个更或多或少等效的解决方案是为原始训练文档预测主题，类似于链接问题中描述的方法，但据我所知，必要的方法没有在Python API中实现。它们是否隐藏在某个地方，或者在Pyspark中有其他实现方式？ - moustachio

1

据我所知，它无法从Python中访问。 - zero323

看起来这个合并的拉取请求添加了topicDistributions函数。 - Quentin Le Sceller

1

这个问题在Pyspark 2.0.0中有答案吗？ - Hardik Gupta

3个回答

5

以下是针对PySpark和Spark 2.0的扩展响应。

抱歉我将此作为回复而不是评论发布，希望您能理解，因为我目前的声望还不够。

我假设您有一个经过训练的LDA模型，该模型是从语料库中构建的，如下所示:

lda = LDA(k=NUM_TOPICS, optimizer="em")
ldaModel = lda.fit(corpus) # Where corpus is a dataframe with 'features'.

为了将文档转换成主题分布，我们创建一个数据框，其中包含文档ID和单词的矢量（稀疏更好）。

documents = spark.createDataFrame([
    [123myNumericId, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count}],
    [2, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count, another:1.0}],
], schema=["id", "features"]
transformed = ldaModel.transform(documents)
dist = transformed.take(1)
# dist[0]['topicDistribution'] is now a dense vector of our topics.

- Joseph Catrambone

2

你能分享一下你使用的完整代码吗？我在运行转换方法时遇到了Pyspark和Spark 2.0的问题。（它说该方法不可用） - E B

transform方法返回的dist对象包含哪些数据类型？是否包括所有Scala API输出，例如topTopicsPerDocument等？为什么我们似乎是第一个尝试使用Spark LDA的人？最好完全避免使用这段代码...它看起来像是alpha级别的。 - Geoffrey Anderson

显然，上面的示例代码适用于Spark 2+。请查看https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA以获取更多信息。 - Eb Abadi

3

从Spark 2.0开始，您可以使用pyspark.ml.clustering.DistributedLDAModel中的transform()方法。我刚刚在scikit-learn的20个新闻组数据集上尝试过，并且它有效。请查看返回的vectors，它是文档主题分布。

>>> test_results = ldaModel.transform(wordVecs)
Row(filename='/home/jovyan/work/data/20news_home/20news-bydate-test/rec.autos/103343', target=7, text='I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.', tokens=['little', 'confused', 'models', 'bonnevilles', 'someone', 'differences', 'features', 'performance', 'curious', 'prefereably', 'usually', 'demand', 'spring', 'summer'], vectors=SparseVector(10977, {28: 1.0, 29: 1.0, 152: 1.0, 301: 1.0, 496: 1.0, 552: 1.0, 571: 1.0, 839: 1.0, 1114: 1.0, 1281: 1.0, 1288: 1.0, 1624: 1.0}), topicDistribution=DenseVector([0.0462, 0.0538, 0.045, 0.0473, 0.0545, 0.0487, 0.0529, 0.0535, 0.0467, 0.0549, 0.051, 0.0466, 0.045, 0.0487, 0.0482, 0.0509, 0.054, 0.0472, 0.0547, 0.0501]))

- Evan Zamir

注意：管理员，我删除了另一个帖子并在此回答。 - Evan Zamir

2

你的例子中，wordVecs是什么？ - Hardik Gupta

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- moustachio · Accepted Answer

经过广泛的研究，目前版本的Spark（1.5.1）无法通过Python API实现此功能。但是，在Scala中，这相当简单（假设有一个RDD documents 用于训练）：

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}

// first generate RDD of documents...

val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)

# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]

获取文档主题分布非常简单，只需执行以下操作：

distLDAModel.topicDistributions