我运行以下代码将文本矩阵转换为TF-IDF矩阵。
text = ['This is a string','This is another string','TFIDF computation calculation','TfIDF is the product of TF and IDF']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=1.0, min_df=1, stop_words='english',norm = None)
X = vectorizer.fit_transform(text)
X_vovab = vectorizer.get_feature_names()
X_mat = X.todense()
X_idf = vectorizer.idf_
我得到了以下输出
X_vovab =
[u'calculation',
u'computation',
u'idf',
u'product',
u'string',
u'tf',
u'tfidf']
并且 X_mat =
([[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 0. , 0. , 0. , 0. , 1.51082562,
0. , 0. ],
[ 1.91629073, 1.91629073, 0. , 0. , 0. ,
0. , 1.51082562],
[ 0. , 0. , 1.91629073, 1.91629073, 0. ,
1.91629073, 1.51082562]])
现在我不明白这些分数是如何计算的。我的想法是,对于text [0],仅计算'string'的分数,并且存在第5列中的一个分数。但是,由于TF-IDF是词频乘以IDF的乘积(其中词频为2,IDF为log(4/2)),因此得出的结果应该是1.39而不是矩阵中显示的1.51。scikit-learn中如何计算TF-IDF分数。