我认为 Christian Perone 的示例是使用 Count Vectorizer 和 TF_IDF 的最直接示例。这是直接来自他的网页。但我也受益于这里的答案。
https://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print "Vocabulary:", count_vectorizer.vocabulary
freq_term_matrix = count_vectorizer.transform(test_set)
print freq_term_matrix.todense()
现在我们有了频率术语矩阵(称为freq_term_matrix),我们可以实例化TfidfTransformer,它将负责计算我们的词频矩阵的tf-idf权重:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print "IDF:", tfidf.idf_
]
请注意,我已将规范指定为L2,这是可选的(实际上默认为L2-norm),但我添加了该参数,以明确告诉您它将使用L2-norm。还要注意,您可以通过访问名为idf_的内部属性来查看计算出的idf权重。现在fit()方法已经计算出矩阵的idf,让我们将freq_term_matrix转换为tf-idf权重矩阵:
--- 我不得不对Python进行以下更改,并注意.vocabulary_包括单词“the”。我还没有找到或构建解决方案... 但是---
from sklearn.feature_extraction.text import CountVectorizer
train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright.", "We can see the shining sun, the bright sun."]
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print ("Vocabulary:")
print(count_vectorizer.vocabulary_)
Vocab = list(count_vectorizer.vocabulary_)
print(Vocab)
freq_term_matrix = count_vectorizer.transform(test_set)
print (freq_term_matrix.todense())
count_array = freq_term_matrix.toarray()
df = pd.DataFrame(data=count_array, columns=Vocab)
print(df)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print ("IDF:")
print(tfidf.idf_)
max_features
参数,而语料库的原始词汇量为1000。我该如何获取所选特征的名称并将它们映射到生成的矩阵中? - Clock Slavev.get_feature_names()
将为您提供特征名称列表。v.vocabulary_
将给出一个字典,其中包含以特征名称为键,以其在生成的矩阵中的索引为值。 - arthur