我正在使用scikit learn中的TfidfVectorizer来从文本数据创建矩阵。现在我需要保存这个对象以便以后重复使用它。我尝试使用pickle,但它给出了以下错误:
loc=open('vectorizer.obj','w')
pickle.dump(self.vectorizer,loc)
*** TypeError: can't pickle instancemethod objects
我尝试在sklearn.externals中使用joblib,但是出现了类似的错误。有没有办法可以保存这个对象,以便以后重复使用?
这是我的完整对象:
class changeToMatrix(object):
def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
from sklearn.feature_extraction.text import TfidfVectorizer
self.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,\
token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=tokenizer)
def load_ref_text(self,text_file):
textfile = open(text_file,'r')
lines=textfile.readlines()
textfile.close()
lines = ' '.join(lines)
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = [ sent_tokenizer.tokenize(lines.strip()) ]
sentences1 = [item.strip().strip('.') for sublist in sentences for item in sublist]
chk2=pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer is transformed in this step
return sentences1,[chk2]
def get_processed_data(self,data_loc):
ref_sentences,ref_dataframes=self.load_ref_text(data_loc)
loc=open("indexedData/vectorizer.obj","w")
pickle.dump(self.vectorizer,loc) #getting error here
loc.close()
return ref_sentences,ref_dataframes
self.snowball_stemmer = SnowballStemmer('english')
更改为snowball_stemmer = SnowballStemmer('english')
时,错误被修复了。基本上,我从类的属性中删除了这个,并且错误已经解决了。 - Joswin K JSnowballStemmer('english')
是一个对象,你需要使用SnowballStemmer('english').stem
来获取可迭代对象。 - alvas