我们能否使用自制语料库来训练LDA模型，使用gensim库？

Question

我们能否使用自制语料库来训练LDA模型，使用gensim库？

pythonldagensim

9

我需要将LDA（隐含狄利克雷分布）应用于我收集的20,000个文档的数据库，以获取可能的主题。

如何使用这些文档作为训练语料库，而不是使用其他可用的语料库，例如Brown Corpus或英文维基百科？

您可以参考此页面。

- Animesh Pandey

1

这个问题有点开放式的，如果你能更具体地说明你尝试了什么以及具体存在的问题，那么你可能更容易得到答案。 - ASGM

如果您不喜欢它，只需投票关闭它。 - Games Brainiac

你有没有看过API中的导入函数？你的文档是什么格式的？ - ASGM

感谢您重新打开这个问题！ - Animesh Pandey

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Animesh Pandey · Accepted Answer

经过阅读Gensim包的文档，我发现将文本库转换为语料库有4种方法。

总共有4种语料格式：

Market Matrix (.mm)
SVM Light (.svmlight)
Blie Format (.lad-c)
Low Format (.low)

在这个问题中，如上所述，数据库中有总共19,188个文档。需要读取每个文档并从句子中删除停用词和标点符号，可以使用nltk完成。

import gensim
from gensim import corpora, similarities, models

##
##Text Preprocessing is done here using nltk
##

##Saving of the dictionary and corpus is done here
##final_text contains the tokens of all the documents

dictionary = corpora.Dictionary(final_text)
dictionary.save('questions.dict');
corpus = [dictionary.doc2bow(text) for text in final_text]
corpora.MmCorpus.serialize('questions.mm', corpus)
corpora.SvmLightCorpus.serialize('questions.svmlight', corpus)
corpora.BleiCorpus.serialize('questions.lda-c', corpus)
corpora.LowCorpus.serialize('questions.low', corpus)

##Then the dictionary and corpus can be used to train using LDA

mm = corpora.MmCorpus('questions.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=100, update_every=0, chunksize=19188, passes=20)

使用gensim软件包，可以将数据集转换为可用于LDA主题建模的语料库。