主题分布：在使用Python进行LDA后，我们如何查看哪些文档属于哪个主题

Question

主题分布：在使用Python进行LDA后，我们如何查看哪些文档属于哪个主题

pythonnltkldagensim

32

我能够运行gensim中的LDA代码，并得到了前10个主题及其相应的关键词。

现在，我想进一步了解LDA算法的准确性，看看它们将哪些文档聚类到每个主题中。 gensim LDA是否支持此功能？

基本上，我想做类似于这样的事情，但使用Python和gensim：

LDA with topicmodels, how can I see which topics different documents belong to?

- jxn

gensim 是一个很酷、简单的库。开发者 Radim 也是一个很好的人，可以向他咨询关于他的库的问题。你需要一些可以将文档按主题聚类的东西吗？ - alvas

3个回答

13

如果您想使用这个技巧：

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

在alvas之前的回答中，确保在LdaModel中将minimum_probability设置为0。

gensim.models.ldamodel.LdaModel(corpus,
            num_topics=num_topics, id2word = dictionary,
            passes=2, minimum_probability=0)

否则，由于gensim会抑制任何概率低于最小概率的语料库，因此lda_corpus和documents的维度可能不一致。

将文档分组到主题的另一种替代方法是根据最大概率分配主题。

    lda_corpus = [max(prob,key=lambda y:y[1])
                    for prob in lda[mm] ]
    playlists = [[] for i in xrange(topic_num])]
    for i, x in enumerate(lda_corpus):
        playlists[x[0]].append(documents[i])

请注意，lda[mm]粗略而言是一个列表的列表或二维矩阵。行数是文档数，列数是主题数。每个矩阵元素都是形如(3,0.82)的元组，其中3表示主题索引，0.82表示该主题的相应概率。默认情况下，minimum_probability=0.01，任何概率小于0.01的元组都会被省略在lda[mm]中。如果使用最大概率分组方法，则可以将其设置为1/＃topics。

- nos

是的，我也考虑过按最大概率进行设置！感谢您展示实现。 - jxn

嘿@nos，你能解释一下代码的第一部分是做什么的吗？特别是[0][1] > threshold这部分。这些数字代表什么意思？ - Economist_Ayahuasca

1

@AndresAzqueta lda_corpus的元素形式为[(0，p0)，(1，p1)，...]，其中第一个数字是主题索引，第二个数字是文档属于该主题的相应概率。如果有N个主题，则该列表包含N个元组。但是，如果minimum_probability不为0，则概率低于minimum_probability的元组不包括在该列表中。 - nos

嘿@nos，非常感谢你的答案。那么如果我有五个主题，系列会是：[0] [1]>门槛，[1] [1]>门槛，[2] [1]>门槛，[3] [1]>门槛，[4] [1]>门槛？谢谢。 - Economist_Ayahuasca

2

lda_corpus[i][j]的形式为[(0,t1),(0,t2)...,(0,t10),....(n,t10)]，其中第一个术语表示文档索引，第二个术语表示特定文档中主题的概率。

- ayushi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alvas · Accepted Answer

通过主题的概率，可以尝试设置一些阈值，并将其用作聚类基准，但我确信有比这种“hacky”方法更好的聚类方法。

from gensim import corpora, models, similarities
from itertools import chain

""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
                               update_every=1, chunksize=10000, passes=1)

# Prints the topics.
for top in lda.print_topics():
  print top
print

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

print cluster1
print cluster2
print cluster3

[输出]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user

0.333333333333

['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']

只是为了让它更加清晰：

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
    for topic in doc:
        for topic_id, score in topic:
            scores.append(score)
threshold = sum(scores)/len(scores)

以上代码是将所有文档中所有主题的单词得分求和，然后通过得分数目对总和进行归一化。