Python Scikit-learn，获取LDA每个主题的文档

Question

Python Scikit-learn，获取LDA每个主题的文档

8

我正在对一份文本数据进行LDA分析，使用了这个示例：这里。我的问题是： 我如何知道哪些文档对应哪些主题？ 换句话说，例如主题1的文档在讨论什么？ 以下是我的步骤：

n_features = 1000
n_topics = 8
n_top_words = 20

我逐行读取我的文本文件：

with open('dataset.txt', 'r') as data_file:
    input_lines = [line.strip() for line in data_file.readlines()]
    mydata = [line for line in input_lines]

一个打印主题的函数：

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))                        

    print()

对数据进行向量化处理：

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b',
                                max_features=n_features,
                                stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)

初始化LDA：

lda = LatentDirichletAllocation(n_topics=3, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

在tf数据上运行LDA：

lda.fit(tf)

使用上面的函数打印结果： ```html

使用上面的函数打印结果：

```

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()

print_top_words(lda, tf_feature_names, n_top_words)

打印输出的结果是：

Topics in LDA model:
Topic #0:
solar road body lamp power battery energy beacon
Topic #1:
skin cosmetic hair extract dermatological aging production active
Topic #2:
cosmetic oil water agent block emulsion ingredients mixture

- passion

2个回答

0

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation.transform

transform方法的输入为文本矩阵X，输出为X所表示文档的主题分布。

因此，如果你对于每一个文档都进行transform操作，就可以查找那些由你感兴趣的主题单词组成占比高（达到你自己设定的阈值）的文档。

- Ryan Stout

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marcel · Accepted Answer

你需要对数据进行转换：

doc_topic = lda.transform(tf)

并列出该文档及其最高分主题，如下所示：

for n in range(doc_topic.shape[0]):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}\n".format(n,topic_most_pr))