如何使用gensim过滤掉语料库中tf-idf值较低的词?

9
我正在使用gensim进行一些自然语言处理任务。我已经从dictionary.doc2bow创建了一个语料库,其中dictionarycorpora.Dictionary对象。现在我想在运行LDA模型之前过滤掉tf-idf值低的术语。我查看了语料库类的文档,但找不到访问术语的方法。有什么想法吗?谢谢。

请查看以下链接:https://dev59.com/72Ij5IYBdhLWcg3wilrB - Daniel
4个回答

7

假设你的语料库如下:

corpus = [dictionary.doc2bow(doc) for doc in documents]

运行TFIDF后,您可以检索出低价值单词列表:
tfidf = TfidfModel(corpus, id2word=dictionary)

low_value = 0.2
low_value_words = []
for bow in corpus:
    low_value_words += [id for id, value in tfidf[bow] if value < low_value]

在运行LDA之前,请从字典中过滤掉它们:

dictionary.filter_tokens(bad_ids=low_value_words)

重新计算语料库,现在已经过滤掉低价值单词:
new_corpus = [dictionary.doc2bow(doc) for doc in documents]

5
如果一个词在所有文档中的tf-idf值都高于阈值,但在只有一个文档中的tf-idf值低于阈值,则该词将从所有文档中删除。 - саша
如果我们使用tf-idf来过滤标记,那么这与dictionary.filter_extremes()有什么不同? - satish silveri

3
这与之前的回答基本相同,但还可以处理tf-idf表示中由于0分数(出现在所有文档中的术语)而缺失的单词。以前的答案没有过滤这样的单词,它们仍然出现在最终语料库中。
#Same as before

dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)


#Filter low value words and also words missing in tfidf models.

low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]  

#reassign        
corpus[i] = new_bow

3

虽然这是旧的内容,但如果你想在每个文档级别查看它,请按照以下步骤进行:

#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)

#filter low value words
low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    new_bow = [b for b in bow if b[0] not in low_value_words]

    #reassign        
    corpus[i] = new_bow

1
假设您有一个文档tfidf_doc,它是由gensim的TfidfModel()生成的相应词袋文档bow_doc,并且您想要过滤掉在该文档中tfidf值低于该文档单词数的cut_percent%的单词,您可以调用tfidf_filter(tfidf_doc, cut_percent),然后它将返回一个剪裁版本的tfidf_doc
def tfidf_filter(tfidf_doc, cut_percent):

    sorted_by_tfidf = sorted(tfidf_doc, key=lambda tup: tup[1])
    cut_value = sorted_by_tfidf[int(len(sorted_by_tfidf)*cut_percent)][1]

    #print('before cut:',len(tfidf_doc))

    #print('cut value:', cut_value)
    for i in range(len(tfidf_doc)-1, -1, -1):
        if tfidf_doc[i][1] < cut_value:
            tfidf_doc.pop(i)

    #print('after cut:',len(tfidf_doc))

    return tfidf_doc

如果你想将文档bow_doc按照得到的tfidf_doc进行过滤,则只需调用filter_bow_by_tfidf(bow_doc, tfidf_doc),它将返回bow_doc的剪裁版本:

def filter_bow_by_tfidf(bow_doc, tfidf_doc):
    bow_idx = len(bow_doc)-1
    tfidf_idx = len(tfidf_doc)-1

    #print('before :', len(bow_doc))

    while True:
        if bow_idx < 0: break

        if tfidf_idx < 0:
            #print('pop2 :', bow_doc.pop(bow_idx))
            bow_doc.pop(bow_idx)
            bow_idx -= 1
        if bow_doc[bow_idx][0] > tfidf_doc[tfidf_idx][0]:
            #print('pop1 :', bow_doc.pop(bow_idx))
            bow_doc.pop(bow_idx)
            bow_idx -= 1
        if bow_doc[bow_idx][0] == tfidf_doc[tfidf_idx][0]:
            #print('keep :', bow_doc[bow_idx])
            bow_idx -= 1
            tfidf_idx -= 1

    #print('after :', len(bow_doc))

    return bow_doc

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接