我能够运行gensim中的LDA代码,并得到了前10个主题及其相应的关键词。
现在,我想进一步了解LDA算法的准确性,看看它们将哪些文档聚类到每个主题中。 gensim LDA是否支持此功能?
基本上,我想做类似于这样的事情,但使用Python和gensim:
LDA with topicmodels, how can I see which topics different documents belong to?
我能够运行gensim中的LDA代码,并得到了前10个主题及其相应的关键词。
现在,我想进一步了解LDA算法的准确性,看看它们将哪些文档聚类到每个主题中。 gensim LDA是否支持此功能?
基本上,我想做类似于这样的事情,但使用Python和gensim:
LDA with topicmodels, how can I see which topics different documents belong to?
通过主题的概率,可以尝试设置一些阈值,并将其用作聚类基准,但我确信有比这种“hacky”方法更好的聚类方法。
from gensim import corpora, models, similarities
from itertools import chain
""" DEMO """
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
# Create Dictionary.
id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]
# Trains the LDA models.
lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=3, \
update_every=1, chunksize=10000, passes=1)
# Prints the topics.
for top in lda.print_topics():
print top
print
# Assigns the topics to the documents in corpus
lda_corpus = lda[mm]
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print threshold
print
cluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]
print cluster1
print cluster2
print cluster3
[输出]
:
0.131*trees + 0.121*graph + 0.119*system + 0.115*user + 0.098*survey + 0.082*interface + 0.080*eps + 0.064*minors + 0.056*response + 0.056*computer
0.171*time + 0.171*user + 0.170*response + 0.082*survey + 0.080*computer + 0.079*system + 0.050*trees + 0.042*graph + 0.040*minors + 0.040*human
0.155*system + 0.150*human + 0.110*graph + 0.107*minors + 0.094*trees + 0.090*eps + 0.088*computer + 0.087*interface + 0.040*survey + 0.028*user
0.333333333333
['The EPS user interface management system', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors A survey']
['A survey of user opinion of computer system response time', 'Relation of user perceived response time to error measurement']
['Human machine interface for lab abc computer applications', 'System and human system engineering testing of EPS', 'Graph minors IV Widths of trees and well quasi ordering']
只是为了让它更加清晰:
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = []
for doc in lda_corpus
for topic in doc:
for topic_id, score in topic:
scores.append(score)
threshold = sum(scores)/len(scores)
以上代码是将所有文档中所有主题的单词得分求和, 然后通过得分数目对总和进行归一化。
scores = list(chain(*[[score for topic,score in topic] \ for topic in [doc for doc in lda_corpus]])) threshold = sum(scores)/len(scores)
这段代码的作用是从LDA主题模型中获取所有文档的得分,并计算它们的平均值,以确定一个阈值。在代码中,首先使用列表推导式将lda_corpus中的所有文档转换为文档主题列表,并使用另一个列表推导式将每个文档主题列表转换为成对的得分值。然后使用chain()
函数将所有这些得分值连接到单个列表scores
中。最后,通过将所有得分总和除以得分数目来计算阈值。 - jxncluster1 = [j for i,j in zip(lda_corpus,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus,documents) if i[2][1] > threshold]
在alvas之前的回答中,确保在LdaModel中将minimum_probability设置为0。
gensim.models.ldamodel.LdaModel(corpus,
num_topics=num_topics, id2word = dictionary,
passes=2, minimum_probability=0)
否则,由于gensim会抑制任何概率低于最小概率的语料库,因此lda_corpus和documents的维度可能不一致。 lda_corpus = [max(prob,key=lambda y:y[1])
for prob in lda[mm] ]
playlists = [[] for i in xrange(topic_num])]
for i, x in enumerate(lda_corpus):
playlists[x[0]].append(documents[i])
请注意,lda[mm]
粗略而言是一个列表的列表或二维矩阵。行数是文档数,列数是主题数。每个矩阵元素都是形如(3,0.82)
的元组,其中3表示主题索引,0.82表示该主题的相应概率。默认情况下,minimum_probability=0.01
,任何概率小于0.01的元组都会被省略在lda[mm]
中。如果使用最大概率分组方法,则可以将其设置为1/#topics。
lda_corpus[i][j]的形式为[(0,t1),(0,t2)...,(0,t10),....(n,t10)],其中第一个术语表示文档索引,第二个术语表示特定文档中主题的概率。
gensim
是一个很酷、简单的库。开发者 Radim 也是一个很好的人,可以向他咨询关于他的库的问题。你需要一些可以将文档按主题聚类的东西吗? - alvas