如何监控Gensim LDA模型的收敛性?

16
1个回答

16

您希望绘制模型拟合的收敛情况,这是正确的想法。不幸的是,Gensim似乎并没有提供非常直接的方法。

  1. Run the model in such a way that you will be able to analyze the output of the model fitting function. I like to setup a log file.

    import logging
    logging.basicConfig(filename='gensim.log',
                        format="%(asctime)s:%(levelname)s:%(message)s",
                        level=logging.INFO)
    
  2. Set the eval_every parameter in LdaModel. The lower this value is the better resolution your plot will have. However, computing the perplexity can slow down your fit a lot!

    lda_model = 
    LdaModel(corpus=corpus,
             id2word=id2word,
             num_topics=30,
             eval_every=10,
             pass=40,
             iterations=5000)
    
  3. Parse the log file and make your plot.

    import re
    import matplotlib.pyplot as plt
    p = re.compile("(-*\d+\.\d+) per-word .* (\d+\.\d+) perplexity")
    matches = [p.findall(l) for l in open('gensim.log')]
    matches = [m for m in matches if len(m) > 0]
    tuples = [t[0] for t in matches]
    perplexity = [float(t[1]) for t in tuples]
    liklihood = [float(t[0]) for t in tuples]
    iter = list(range(0,len(tuples)*10,10))
    plt.plot(iter,liklihood,c="black")
    plt.ylabel("log liklihood")
    plt.xlabel("iteration")
    plt.title("Topic Model Convergence")
    plt.grid()
    plt.savefig("convergence_liklihood.pdf")
    plt.close()
    

1
这个情节有助于确定通行证的数量吗?通行证和迭代之间有什么区别?谢谢! - Victor Wang
1
@VictorWang 或许这可以帮助你: "passes 控制我们对整个语料库进行模型训练的频率。另一个词可以称为“epochs”。iterations 相对技术性,但基本上它控制我们多少次重复每份文档上的特定循环。将“passes”和“iterations”的数量设置足够高是很重要的。" 参考链接: https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html - Ferran

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接