从一组文档中提取最重要的关键词

Question

从一组文档中提取最重要的关键词

4

我有一组3000个文本文档，想要提取前300个关键词(可以是单个词或多个词)。

我已经尝试了以下方法： RAKE：这是一个基于Python的关键词提取库，但它表现不佳。 Tf-Idf：它为每个文档提供了良好的关键词，但无法汇总它们并找到代表整个文档组的关键词。此外，仅基于Tf-Idf分数从每个文档中选择前k个单词是不够的，对吗？ Word2vec：我能够做一些很酷的事情，比如找到相似的单词，但不确定如何使用它找到重要的关键词。

你能否建议一些好的方法（或详细说明如何改进以上3种方法）来解决这个问题？谢谢 :)

- Vini

4个回答

0

最好让您手动选择那300个单词（这不是很多，而且只需一次）- 代码使用Python 3编写

import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files: 
        file_opened = open(file, "r")
        lines = file_opened.read().split("\n")
        for word in topWords: 
                if word in lines and wordsCount < 301:
                                print("I found %s" %word)
                                wordsCount += 1
        #Check Again wordsCount to close first repetitive instruction
        if wordsCount == 300:
                break

- ricristian

1

这个回答并没有回答“自动提取”的问题。阅读3000份文件并逐个提取关键词会非常耗时。 - Luca Foppiano

确实如此，但正如我已经提到的，如果这是一次性操作，我不认为脚本需要花费1秒钟或1分钟有多重要... 如果我的回答并没有真正帮助到你... 我可以删除它。@LucaFoppiano，这样可以吗？谢谢。 - ricristian

我认为你的回答存在几个问题，因为事先不知道掌握这300个单词是一个困难的任务。实际上，你的脚本在尝试做什么并不清楚；-) 因为topWords已经是已知的。 - Luca Foppiano

-1

import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
    open_file = open(file, "r")
    for line in open_file.readlines():
        raw_words = line.split()
        for word in raw_words:
            words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))

现在从排序后的单词中取前300个，它们就是你想要的单词。

- Awaish Kumar

谢谢@Awaish，但我也尝试过这种方法。结果非常糟糕，因为重要术语只出现一两次。如果我尝试根据频率对Tf-idf术语进行排序和选择，会出现很多常见且不相关的术语。 - Vini

这个解决方案意味着你已经知道你要查找的单词。 - Luca Foppiano

-1

应用tf-idf实现最重要单词的最简单有效方法。如果您有停用词，可以在应用此代码之前过滤掉停用词。希望这对您有所帮助。

import java.util.List;

/**
 * Class to calculate TfIdf of term.
 * @author Mubin Shrestha
 */
public class TfIdf {

    /**
     * Calculates the tf of term termToCheck
     * @param totalterms : Array of all the words under processing document
     * @param termToCheck : term of which tf is to be calculated.
     * @return tf(term frequency) of term termToCheck
     */
    public double tfCalculator(String[] totalterms, String termToCheck) {
        double count = 0;  //to count the overall occurrence of the term termToCheck
        for (String s : totalterms) {
            if (s.equalsIgnoreCase(termToCheck)) {
                count++;
            }
        }
        return count / totalterms.length;
    }

    /**
     * Calculates idf of term termToCheck
     * @param allTerms : all the terms of all the documents
     * @param termToCheck
     * @return idf(inverse document frequency) score
     */
    public double idfCalculator(List allTerms, String termToCheck) {
        double count = 0;
        for (String[] ss : allTerms) {
            for (String s : ss) {
                if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                }
            }
        }
        return 1 + Math.log(allTerms.size() / count);
    }
}

- shiv

谢谢@shiv。但我已经使用Lucene实现了Tf-Idf（以加快处理速度）。问题是Tf-Idf只能给出每个文档的“重要术语”，而不能覆盖整个文档集。 - Vini

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user12446118 · Accepted Answer

虽然潜在狄利克雷分配和层次狄利克雷过程通常用于从文本语料库中提取主题，并使用这些主题对单个条目进行分类，但也可以开发一种方法来为整个语料库提取关键词。该方法不依赖于另一个文本语料库。基本工作流程是将这些狄利克雷关键字与最常见的单词进行比较，以查看LDA或HDP是否能够捕捉到未包含在最常见单词中的重要单词。

在使用以下代码之前，通常建议进行文本预处理：

从文本中删除标点符号（参见string.punctuation）
将字符串文本转换为“标记”（str.split（‘ ’）。lower（）到单个单词）
删除数字和停用词（请参见stopwordsiso或stop_words）
创建bigrams-文本中经常一起出现的单词组合（请参见gensim.Phrases）
对令牌进行词形还原-将单词转换为其基本形式（请参见spacy或NLTK）
删除不够频繁的令牌（或太频繁，但在这种情况下跳过删除太频繁的令牌，因为这些将是好的关键字）

这些步骤将会创建变量corpus。关于LDA的详细概述和解释可以在这里找到。

现在介绍使用gensim进行LDA和HDP的方法：

from gensim.models import LdaModel, HdpModel
from gensim import corpora

首先创建一个Dirichlet字典，将corpus中的单词映射到索引，然后使用它来创建一个词袋，其中corpus中的标记被其索引替换。这是通过以下方式完成的：

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

对于LDA模型，需要确定最佳主题数量，可以通过这个答案中的方法启发式地完成。假设我们的最佳主题数量为10个，并且根据问题要求，我们需要300个关键词：

num_topics = 10
num_keywords = 300

创建一个LDA模型：

dirichlet_model = LdaModel(corpus=bow_corpus,
                           id2word=dirichlet_dict,
                           num_topics=num_topics,
                           update_every=1,
                           chunksize=len(bow_corpus),
                           passes=20,
                           alpha='auto')

接下来是一个函数，根据话语连贯性的平均值来确定最佳话题。首先会生成每个话题最重要词汇的有序列表；然后找到每个话题与整个文集的平均连贯性；最后基于这个平均连贯性对话题进行排序，并返回连同后续使用的平均值列表一起。所有代码如下（包括以下使用HDP的选项）：

def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
    """
    Orders topics based on their average coherence across the corpus

    Parameters
    ----------
        dirichlet_model : gensim.models.type_of_model
        bow_corpus : list of lists (contains (id, freq) tuples)
        num_topics : int (default=10)
        num_keywords : int (default=10)

    Returns
    -------
        ordered_topics, ordered_topic_averages: list of lists and list
    """
    if type(dirichlet_model) == gensim.models.ldamodel.LdaModel:
        shown_topics = dirichlet_model.show_topics(num_topics=num_topics, 
                                                   num_words=num_keywords,
                                                   formatted=False)
    elif type(dirichlet_model)  == gensim.models.hdpmodel.HdpModel:
        shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
                                                   num_words=num_keywords,
                                                   formatted=False)
    model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
    topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 

    topics_per_response = [response for response in topic_corpus]
    flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]

    significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
    topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
                      for topic_num in significant_topics]

    topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]

    significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
    ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # limit for HDP

    ordered_topic_averages = [topic_averages[i] for i in topic_indexes_by_avg_coherence][:num_topics] # limit for HDP
    ordered_topic_averages = [a/sum(ordered_topic_averages) for a in ordered_topic_averages] # normalize HDP values

    return ordered_topics, ordered_topic_averages

现在需要获取一个关键词列表 - 跨主题最重要的单词。这是通过从每个有序主题中子集化单词（默认情况下按重要性排序）来完成的，基于它们对整体的平均相干性。为了明确解释，假设只有两个主题，并且文本与第一个主题70％相干，第二个主题30％相干。关键词可以是第一个主题中前70％的单词和第二个主题中前30％的单词，前提是它们尚未被选择。这是通过以下方式实现的：

ordered_topics, ordered_topic_averages = \
    order_subset_by_coherence(dirichlet_model=dirichlet_model,
                              bow_corpus=bow_corpus, 
                              num_topics=num_topics,
                              num_keywords=num_keywords)

keywords = []
for i in range(num_topics):
    # Find the number of indexes to select, which can later be extended if the word has already been selected
    selection_indexes = list(range(int(round(num_keywords * ordered_topic_averages[i]))))
    if selection_indexes == [] and len(keywords) < num_keywords: 
        # Fix potential rounding error by giving this topic one selection
        selection_indexes = [0]
              
    for s_i in selection_indexes:
        if ordered_topics[i][s_i] not in keywords and ordered_topics[i][s_i] not in ignore_words:
            keywords.append(ordered_topics[i][s_i])
        else:
            selection_indexes.append(selection_indexes[-1] + 1)

# Fix for if too many were selected
keywords = keywords[:num_keywords]

上述内容还包括变量ignore_words，它是一个单词列表，不应包含在结果中。

对于HDP模型，其过程与上述类似，但在模型创建时不需要传递num_topics和其他参数。HDP自己推导出最佳主题，但这些主题需要使用order_subset_by_coherence进行排序和子集化，以确保最佳主题用于有限选择。可以通过以下方式创建模型：

dirichlet_model = HdpModel(corpus=bow_corpus, 
                           id2word=dirichlet_dict,
                           chunksize=len(bow_corpus))

最好测试LDA和HDP，因为如果能找到合适的主题数量，LDA可以胜任问题的需要（这仍然是HDP的标准）。将狄利克雷关键字与单词频率进行比较，希望生成的关键字列表更相关于文本的整体主题，而不仅仅是最常见的单词。

显然，根据百分文本连贯性从主题中选择有序单词不能给出关键字按重要性的总体排序，因为在整体连贯性较低的主题中非常重要的一些单词将被后来选择。

使用LDA为语料库中的个别文本生成关键字的过程可以在this answer中找到。