我正在尝试使用scikit-learn构建一个简单的SVM文档分类器,我正在使用以下代码:
import os
import numpy as np
import scipy.sparse as sp
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file
clf=svm.SVC()
path="C:\\Python27"
f1=[]
f2=[]
data2=['omg this is not a ship lol']
f=open(path+'\\mydata\\ACQ\\acqtot','r')
f=f.read()
f1=f.split(';',1085)
for i in range(0,1086):
f2.append('acq')
f1.append('shipping ship')
f2.append('crude')
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)
x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)
num_sample,num_features=x_train.shape
test_sample,test_features=x_test.shape
print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37
y=['acq','crude']
#print x_test.n_features
clf.fit(x_train,f2)
#den= clf.score(x_test,y)
clf.predict(x_test)
它会返回以下错误信息:
(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time
但我不明白的是为什么它要求特征数量相同?如果我向机器输入全新的文本数据进行预测,显然不可能每个文档都与用于训练的数据具有相同数量的特征。在这种情况下,我们是否需要明确设置测试数据的特征数量等于9451?
transform
而不是fit_transform
”。问题实际上是一样的,在测试集上使用fit_transform
。 - Fred Foo