我使用一个玩具语料库构建了一个LDA主题模型,具体步骤如下:
我发现当我使用少量主题推导模型时,Gensim会为测试文档提供所有潜在主题的主题分布完整报告。例如:
然而,当我使用大量主题时,报告不再完整。
documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']
texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)
id2word = {}
for word in dictionary.token2id:
id2word[dictionary.token2id[word]] = word
我发现当我使用少量主题推导模型时,Gensim会为测试文档提供所有潜在主题的主题分布完整报告。例如:
test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]
Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]
然而,当我使用大量主题时,报告不再完整。
test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]
在我的观察中,输出结果中概率小于某个阈值(我观察到具体为0.01)的主题被省略了。
我想知道这种行为是否出于美学考虑?还有,我该如何获取所有其他主题上的概率质量残差分布?
谢谢您的友好回答!