主题建模的评估：如何理解一致性值/ c_v为0.4，这是好还是坏？

Question

主题建模的评估：如何理解一致性值/ c_v为0.4，这是好还是坏？

data-scienceldatopic-modeling

14

我需要知道0.4的相干性得分是好还是坏？我使用LDA算法进行主题建模。

在这种情况下，平均相干性得分是多少？

- User Mohamed

3个回答

4

除了Sara给出的优秀答案之外，UMass相干性度量了语料库中两个词（Wi，Wj）一起出现的频率。它的定义如下：

D(Wi, Wj) = log [ (D(Wi, Wj) + EPSILON) / D(Wi) ]

其中：

D(Wi, Wj)表示单词Wi和单词Wj一起出现的次数

D(Wi)表示单词Wi在语料库中独立出现的次数

EPSILON是一个小值(如1e-12)，加到分子中避免出现0值

如果Wi和Wj从未同时出现，则结果为log（0），这将导致错误。EPSILON值是一种解决方法。

综上所述，你可以得到一个非常大的负数，直到接近0的值。解释与Sara所写的相同，即数字越大越好，而0显然是错误的。

- Muhammad Ali

1

我想补充的是，好坏是相对于你所处理的语料库和其他聚类的分数而言的。

在Sara提供的链接中，文章显示33个主题是最佳的，一致性得分约为0.33，但正如作者所提到的，该聚类内可能存在重复的术语。在这种情况下，您需要将最佳聚类分解的术语/片段与较低一致性得分进行比较，以查看结果是否更易解释。

当然，您应该调整模型的参数，但得分是有上下文依赖的，并且我认为您不能仅凭特定的一致性得分来说聚类数据最优，而不先了解数据的情况。话虽如此，正如Sara所提到的，约为1或约为0的得分可能是错误的。

您可以将模型与基准数据集进行比较，如果它具有更高的一致性，则可以更好地衡量您的模型工作情况。

这篇论文对我很有帮助：https://rb.gy/kejxkz

- Patrick Cullinane

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sara · Accepted Answer

一致性测量了主题内单词之间的相对距离。有两种主要类型：C_V通常为0

0.3不好

0.4低

0.55可以

0.65可能是最好的

0.7不错

0.8不太可能

0.9可能是错误的

低一致性的解决方法：

调整您的参数 alpha = .1，beta = .01 或 .001，random_state = 123 等
获取更好的数据
当 .4 出现时，你可能有错误的主题数，请参考 https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/ 中所知的拐点法 - 它为您提供了最大一致性数据集中最佳主题数量的图表。我正在使用 mallet，它具有相当不错的一致性，这里是检查不同主题数量的一致性的代码：

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
    
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

我希望这可以帮助你 :)