我在使用Python和Scikit learn中的随机森林时遇到了困难。我的问题是,我将其用于文本分类(分为3类-积极/消极/中性),而我提取的特征主要是单词/单元组,因此我需要将这些转换为数值特征。我找到了一种使用DictVectorizer
的fit_transform
来实现的方法:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
rf = RandomForestClassifier(n_estimators = 100)
trainFeatures1 = vec.fit_transform(trainFeatures)
# Fit the training data to the training output and create the decision trees
rf = rf.fit(trainFeatures1.toarray(), LabelEncoder().fit_transform(trainLabels))
testFeatures1 = vec.fit_transform(testFeatures)
# Take the same decision trees and run on the test data
Output = rf.score(testFeatures1.toarray(), LabelEncoder().fit_transform(testLabels))
print "accuracy: " + str(Output)
我的问题是fit_transform
方法正在处理训练数据集,该数据集包含大约8000个实例,但是当我尝试将测试集也转换为数值特征时,该测试集包含大约80000个实例,我会收到一个内存错误,提示:
testFeatures1 = vec.fit_transform(testFeatures)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 143, in fit_transform
return self.transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 251, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
MemoryError
有什么可能会导致这种情况,是否有任何解决方法?非常感谢!
TfIdfVectorizer
пјҢ然еҗҺдҪҝз”ЁTruncatedSVD
жқҘеҮҸе°‘зү№еҫҒз©әй—ҙзҡ„з»ҙеәҰгҖӮ - MattLabelEncoder
。y
可能包含字符串。 - Fred Foo