如何在scikit learn中保存TFIDF向量化器？

Question

如何在scikit learn中保存TFIDF向量化器？

3

我正在使用scikit learn开发一个垃圾邮件分类器。

以下是我的向量化代码。

vectorizer = TfidfVectorizer(
    analyzer='word', 
    sublinear_tf=True,
    strip_accents='unicode',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1),
    max_features=10000)


tfidf = vectorizer.fit(data['text'])
features = vectorizer.transform(data['text'])

import pickle
pickle.dump(tfidf, open('tfidf.pickle', 'wb'))

这是我用来预测新输入的方法：

import joblib

model = joblib.load('model')

vect = pickle.load(open('tfidf.pickle', 'rb'))

new = vect.transform(['some new text...'])

mod.predict(new)

当我打开向量化文件(tfidf.pickle)并尝试预测新信息时，它显示以下错误：

ValueError: X.shape[1] = 7148 应该等于38011，即训练时的特征数量

- Ishaan

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wajsbrot · Accepted Answer

错误消息显示，您的模型期望大小为38011的输入，而您的TF-IDF向量化器输出7148维度的向量。这里存在模型/预处理不匹配，即您的模型是在38011维向量上训练的，而您的TF-IDF输出的向量是7148维的。

避免这种预处理/模型不匹配的好方法是使用scikit-learn pipelines。例如，在此处，您可以使用以下代码片段（此处示例使用逻辑回归）来训练您的模型和TF-IDF向量化器：

from sklearn.preprocessing import make_pipeline

vectorizer = TfidfVectorizer(...your TF-IDF arguments...)
model = LogisticRegression(...your model arguments...)
pipeline = make_pipeline(vectorizer, model)

pipeline.fit(X, y)

然后您可以使用pickle或joblib序列化和加载您的管道(例如，然后pipeline = pickle.load(open('spam_pipeline.pickle', 'rb')))，与您已经执行的操作类似。您可以直接使用管道的predict方法获得预测结果。如果您需要更多详细信息，请告诉我。