如何使用pandas dataframe获取tfidf?

47

我想从以下文件计算tf-idf。我正在使用Python和Pandas。

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

起初,我认为需要获取每行的单词数量。因此,我编写了一个简单的函数:

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

然后,我将其应用于每一行。

df['word_count'] = df['sent'].apply(word_count)

但是现在我有点迷失了。我知道如果我使用Graphlab,有一种简单的方法可以计算tf-idf,但我想坚持使用开源选项。无论Sklearn还是gensim都让我感到很困惑。有没有最简单的解决方案来获取tf-idf?

5个回答

64

Scikit-learn的实现非常简单:

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

你可以指定很多参数。查看 这里的 文档。

如果要可视化输出结果,fit_transform 的输出将会是一个稀疏矩阵,你可以使用x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])

假设我将100传递给max_features参数,而语料库的原始词汇量为1000。我该如何获取所选特征的名称并将它们映射到生成的矩阵中? - Clock Slave
7
v.get_feature_names()将为您提供特征名称列表。v.vocabulary_将给出一个字典,其中包含以特征名称为键,以其在生成的矩阵中的索引为值。 - arthur
是的,但要注意不要打印feature_names()。如果特征数量增加,可能会出现内存问题。 - Ch HaXam

6

一个简单的解决方案是使用texthero

import texthero as hero
df['tfidf'] = hero.tfidf(df['sent'])

In [5]: df.head()
Out[5]:
   docId                         sent                                              tfidf
0      1   This is the first sentence  [0.3816141458138271, 0.6461289150464732, 0.381...
1      2  This is the second sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...
2      3   This is the third sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...

1
这可能是最好和最简单的方法。 - user1098761

0

我发现了一种使用sklearn中的CountVectorizer略有不同的方法。 --计数向量化器: 紫外线分析词频 --预处理/清理文本: Usman Malik 爬取推特预处理 我不会在这个答案中涵盖预处理。基本上你想做的是导入CountVectorizer并将数据拟合到CountVectorizer对象中,这将让你访问.vocabulary._items()属性,它将为你提供数据集的词汇(存在的唯一单词及其频率,给定任何限制参数,如匹配特征数量等)

然后,你将使用Tfidtransformer以类似的方式生成术语的tf-idf权重

我正在使用pandas和pycharm ide编写jupyter笔记本文件。

以下是代码片段:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
countVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)

#%%
#use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents
#raw documents in this case will betweetsFrameWords["Text"] (processed text)
countVec.fit(tweetsFrameWords["Text"])
#useful debug, get an idea of the item list you generated
list(countVec.vocabulary_.items())

#%%
#convert to bag of words
#sparse matrix representation? (README: could use an edit/explanation)
countVec_count = countVec.transform(tweetsFrameWords["Text"])

#%%
#make array from number of occurrences
occ = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()

#make a new data frame with columns term and occurrences, meaning word and number of occurences
bowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})
print(bowListFrame)

#sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASC
bowListFrame.sort_values(by='occurrences', ascending=False).head(60)

#%%
#now, convert to a more useful ranking system, tf-idf weights
#TfidfTransformer: scale raw word counts to a weighted ranking using the
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
tweetTransformer = TfidfTransformer()

#initial fit representation using transformer object
tweetWeights = tweetTransformer.fit_transform(countVec_count)

#follow similar process to making new data frame with word occurrences, but with term weights
tweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()

#now that we've done Tfid, make a dataframe with weights and names
tweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})
print(tweetWeightFrame)
tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)

0

我认为 Christian Perone 的示例是使用 Count Vectorizer 和 TF_IDF 的最直接示例。这是直接来自他的网页。但我也受益于这里的答案。

https://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/

from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print "Vocabulary:", count_vectorizer.vocabulary
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
freq_term_matrix = count_vectorizer.transform(test_set)
print freq_term_matrix.todense()
#[[0 1 1 1]
#[0 2 1 0]]

现在我们有了频率术语矩阵(称为freq_term_matrix),我们可以实例化TfidfTransformer,它将负责计算我们的词频矩阵的tf-idf权重:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print "IDF:", tfidf.idf_
# IDF: [ 0.69314718 -0.40546511 -0.40546511  0.   
]

请注意,我已将规范指定为L2,这是可选的(实际上默认为L2-norm),但我添加了该参数,以明确告诉您它将使用L2-norm。还要注意,您可以通过访问名为idf_的内部属性来查看计算出的idf权重。现在fit()方法已经计算出矩阵的idf,让我们将freq_term_matrix转换为tf-idf权重矩阵:
--- 我不得不对Python进行以下更改,并注意.vocabulary_包括单词“the”。我还没有找到或构建解决方案... 但是---
from sklearn.feature_extraction.text import CountVectorizer

train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright.", "We can see the shining sun, the bright sun."]
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print ("Vocabulary:")
print(count_vectorizer.vocabulary_)
Vocab = list(count_vectorizer.vocabulary_)
print(Vocab)

# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
freq_term_matrix = count_vectorizer.transform(test_set)
print (freq_term_matrix.todense())

count_array = freq_term_matrix.toarray()
df = pd.DataFrame(data=count_array, columns=Vocab)
print(df)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print ("IDF:")
print(tfidf.idf_)

0
两种简单的解决方案使用TfidfVectorizer来自sklearn
a) 如果你的corpus是一个pandas.Series
vectorizer = TfidfVectorizer()
_X = vectorizer.fit_transform(corpus)
X = pd.DataFrame(_X.todense(), index=corpus.index, columns=vectorizer.vocabulary_)
X.head()

如果你的语料库是一个列表:
vectorizer = TfidfVectorizer()
_X = vectorizer.fit_transform(corpus)
X = pd.DataFrame(_X.todense(), columns=vectorizer.vocabulary_)
X.head()

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接