主题分布:在使用Python进行LDA后,我们如何查看哪些文档属于哪个主题

32

gensim 是一个很酷、简单的库。开发者 Radim 也是一个很好的人,可以向他咨询关于他的库的问题。你需要一些可以将文档按主题聚类的东西吗? - alvas
3个回答

34

通过主题的概率,可以尝试设置一些阈值,并将其用作聚类基准,但我确信有比这种“hacky”方法更好的聚类方法。

from gensim import corpora, models, similarities
from itertools import chain

""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]

# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
                               update_every=1, chunksize=10000, passes=1)

# Prints the topics.
for top in lda.print_topics():
  print top
print

# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]

# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
                      for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print

cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

print cluster1
print cluster2
print cluster3

[输出]:

0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user

0.333333333333

['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']

只是为了让它更加清晰:


# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
    for topic in doc:
        for topic_id, score in topic:
            scores.append(score)
threshold = sum(scores)/len(scores)

以上代码是将所有文档中所有主题的单词得分求和, 然后通过得分数目对总和进行归一化。


1
这看起来是个不错的解决方案!我找到的另一个解决方案是使用主题分布来进行K均值聚类。就像在这个链接中所看到的https://dev59.com/Hmw15IYBdhLWcg3wo9Rx,但我不确定如何实现它。您知道怎么做吗? - jxn
2
我也在尝试重新实现 Brown 聚类算法(https://dev59.com/IGEi5IYBdhLWcg3wuuYk),不过考虑到给定(主题,概率)元组,你可以尝试使用这个脚本:https://dev59.com/BnvZa4cB1Zd3GeqP-QxU。 - alvas
这就是可怕的部分,没有人知道设置主题的最佳数量,也没有人知道最佳聚类数。虽然我不是计算机科学家,但我相信一定有人可以确定最优的主题/聚类数量。 - alvas
2
我通过删除像这个问题中所述的唯一单词,获得了更好的性能。 - dh762
3
您能否更具体地解释这行代码? scores = list(chain(*[[score for topic,score in topic] \ for topic in [doc for doc in lda_corpus]])) threshold = sum(scores)/len(scores)这段代码的作用是从LDA主题模型中获取所有文档的得分,并计算它们的平均值,以确定一个阈值。在代码中,首先使用列表推导式将lda_corpus中的所有文档转换为文档主题列表,并使用另一个列表推导式将每个文档主题列表转换为成对的得分值。然后使用chain()函数将所有这些得分值连接到单个列表scores中。最后,通过将所有得分总和除以得分数目来计算阈值。 - jxn
显示剩余10条评论

13
如果您想使用这个技巧:
cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]

在alvas之前的回答中,确保在LdaModel中将minimum_probability设置为0。

gensim.models.ldamodel.LdaModel(corpus,
            num_topics=num_topics, id2word = dictionary,
            passes=2, minimum_probability=0)
否则,由于gensim会抑制任何概率低于最小概率的语料库,因此lda_corpus和documents的维度可能不一致。
将文档分组到主题的另一种替代方法是根据最大概率分配主题。
    lda_corpus = [max(prob,key=lambda y:y[1])
                    for prob in lda[mm] ]
    playlists = [[] for i in xrange(topic_num])]
    for i, x in enumerate(lda_corpus):
        playlists[x[0]].append(documents[i])

请注意,lda[mm]粗略而言是一个列表的列表或二维矩阵。行数是文档数,列数是主题数。每个矩阵元素都是形如(3,0.82)的元组,其中3表示主题索引,0.82表示该主题的相应概率。默认情况下,minimum_probability=0.01,任何概率小于0.01的元组都会被省略在lda[mm]中。如果使用最大概率分组方法,则可以将其设置为1/#topics。


是的,我也考虑过按最大概率进行设置!感谢您展示实现。 - jxn
嘿@nos,你能解释一下代码的第一部分是做什么的吗?特别是[0][1] > threshold这部分。这些数字代表什么意思? - Economist_Ayahuasca
1
@AndresAzqueta lda_corpus的元素形式为[(0,p0),(1,p1),...],其中第一个数字是主题索引,第二个数字是文档属于该主题的相应概率。如果有N个主题,则该列表包含N个元组。但是,如果minimum_probability不为0,则概率低于minimum_probability的元组不包括在该列表中。 - nos
嘿@nos,非常感谢你的答案。那么如果我有五个主题,系列会是:[0] [1]>门槛,[1] [1]>门槛,[2] [1]>门槛,[3] [1]>门槛,[4] [1]>门槛?谢谢。 - Economist_Ayahuasca

2

lda_corpus[i][j]的形式为[(0,t1),(0,t2)...,(0,t10),....(n,t10)],其中第一个术语表示文档索引,第二个术语表示特定文档中主题的概率。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接