Scikit-learn中的多标签分类GridSearch

Question

Scikit-learn中的多标签分类GridSearch

7

我想在十折交叉验证的每个独立部分中进行网格搜索以获得最佳超参数，我的以前多类分类工作中运行得很好，但这次多标签工作不是这样。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = OneVsRestClassifier(LinearSVC())

C_range = 10.0 ** np.arange(-2, 9)
param_grid = dict(estimator__clf__C = C_range)

clf = GridSearchCV(clf, param_grid)
clf.fit(X_train, y_train)

我得到了这个错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-65-dcf9c1d2e19d> in <module>()
      6 
      7 clf = GridSearchCV(clf, param_grid)
----> 8 clf.fit(X_train, y_train)

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
    595 
    596         """
--> 597         return self._fit(X, y, ParameterGrid(self.param_grid))
    598 
    599 

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y,   
parameter_iterable)
    357                                  % (len(y), n_samples))
    358             y = np.asarray(y)
--> 359         cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
    360 
    361         if self.verbose > 0:

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _check_cv(cv, X,  
y, classifier, warn_mask)
   1365             needs_indices = None
   1366         if classifier:
-> 1367             cv = StratifiedKFold(y, cv, indices=needs_indices)
   1368         else:
   1369             if not is_sparse:

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, 
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of

ValueError: boolean index array should have 1 dimension

可能是指标签指示器的尺寸或格式。

print X_train.shape, y_train.shape

获取：

(147, 1024) (147, 6)

看起来GridSearch内在地实现了StratifiedKFold。问题出现在具有多标签问题的分层K折策略中。

StratifiedKFold(y_train, 10)

提供

ValueError                                Traceback (most recent call last)
<ipython-input-87-884ffeeef781> in <module>()
----> 1 StratifiedKFold(y_train, 10)

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self,   
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of

ValueError: boolean index array should have 1 dimension

目前传统的K折交叉验证策略使用得很好。但是有没有一种方法来实现分层K折交叉验证用于多标签分类？

- Francis

3个回答

0

正如Fred Foo所指出的，分层交叉验证不能应用于多标签任务。一种替代方案是按照这里的建议，在转换后的标签空间中使用scikit-learn的StratifiedKFold类。

以下是示例Python代码。

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=n_splits, random_state=None, shuffle=shuffle)


for train_index, test_index in kf.split(X, lp.transform(y)):
    X_train = X[train_index,:]
    y_train = y[train_index,:]

    X_test = X[test_index,:]
    y_test = y[test_index,:]

    # learn the classifier
    classifier.fit(X_train, y_train)

    # predict labels for test data
    predictions = classifier.predict(X_test)

- Nikhil

0

请查看scikit-multilearn package。文档并不完美，但这个部分演示了多标签分层。您可以使用iterative_train_test_split函数。

还有iterative-stratification package，我相信它实现了同样的思路。

我不确定，但我认为它们都是在实现this paper。

- Milad Shahidi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Fred Foo · Accepted Answer

网格搜索在分类问题中执行分层交叉验证，但对于多标签任务，这并未实现；事实上，多标签分层是机器学习中尚未解决的问题。最近我也遇到了同样的问题，我能找到的所有文献都是本文中提出的方法（作者声称他们也找不到其他解决方案）。