我使用spaCy为scikit-learn编写了一个引理分词器,基于他们的示例,它可以独立运行,效果还不错:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
class LemmaTokenizer(object):
def __init__(self):
self.spacynlp = spacy.load('en')
def __call__(self, doc):
nlpdoc = self.spacynlp(doc)
nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
return nlpdoc
vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}
然而,在使用GridSearchCV
时会出现错误,下面是一个自包含的示例:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)
### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'
当我在分词器的构造函数之外加载spacy时,错误不会出现,然后
GridSearchCV
运行:spacynlp = spacy.load('en')
class LemmaTokenizer(object):
def __call__(self, doc):
nlpdoc = spacynlp(doc)
nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
return nlpdoc
但这意味着我的每个
n_jobs
都将访问并调用同一个spacynlp对象,它在这些作业之间共享,这引出了以下问题:
spacy.load('en')
的spacynlp对象是否安全可供GridSearchCV中的多个作业使用?- 这是在scikit-learn的分词器中实现对spacy的调用的正确方法吗?
[{"token": "cats", "lemma": "cat"}, {...}]
。这基本上就是Spacy句子转换为JSON的方式。编写一个管道步骤,以此作为输入,并具有参数来输出令牌或词形,这样你就拥有了令牌化作为网格搜索的一部分。 - mbatchkarov