我正在使用scikit-learn
来查找tf-idf
值。
我有一组像这样的文档
:
D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."
我想创建一个类似这样的矩阵:
Docs blue bright sky sun
D1 tf-idf 0.0000000 tf-idf 0.0000000
D2 0.0000000 tf-idf 0.0000000 tf-idf
D3 0.0000000 tf-idf tf-idf tf-idf
所以,我的
Python
代码如下:import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')
transformer = TfidfVectorizer(stop_words=stop_words)
t1 = transformer.fit_transform(train_set).todense()
print t1
我得到的结果矩阵是:
[[ 0.79596054 0. 0.60534851 0. ]
[ 0. 0.4472136 0. 0.89442719]
[ 0. 0.57735027 0.57735027 0.57735027]]
如果我进行手算,那么矩阵应该是这样的:
Docs blue bright sky sun
D1 0.2385 0.0000000 0.0880 0.0000000
D2 0.0000000 0.0880 0.0000000 0.0880
D3 0.0000000 0.058 0.058 0.058
我正在计算,例如将
blue
作为tf
= 1/2 = 0.5
,并将idf
作为log(3/1) = 0.477121255
。因此,tf-idf = tf*idf = 0.5*0.477 = 0.2385
。我正在以这种方式计算其他tf-idf
值。现在,我想知道为什么在手动计算的矩阵和Python计算的矩阵中得到不同的结果?哪个可以给出正确的结果?是我手动计算有误还是我的Python代码有误?