sklearn 断言错误:对于自定义估计器,不等于容差。

3

我正在使用scikit-learn接口创建自定义分类器,仅出于学习目的。因此,我想到了以下代码:

import numpy as np
from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import BaseEstimator, ClassifierMixin, check_X_y
from sklearn.utils.validation import check_array, check_is_fitted, check_random_state

class TemplateEstimator(BaseEstimator, ClassifierMixin):
  def __init__(self, threshold=0.5, random_state=None):
    self.threshold = threshold
    self.random_state = random_state

  def fit(self, X, y):
    self.random_state_ = check_random_state(self.random_state)
    X, y = check_X_y(X, y)
    self.classes_ = np.unique(y)
    self.fitted_ = True
    return self
  
  def predict(self, X):
    check_is_fitted(self)
    X = check_array(X)

    y_hat = self.random_state_.choice(self.classes_, size=X.shape[0])
    return y_hat

check_estimator(TemplateEstimator())

这个分类器只是随机猜测。我尽力遵循scikit-learn文档和指南为开发我的自己的estimator。然而,我收到了以下错误:

AssertionError: 
Arrays are not equal
Classifier cant predict when only one class is present.
Mismatched elements: 10 / 10 (100%)
Max absolute difference: 1.
Max relative difference: 1.
 x: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
 y: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

我不能确定,但我猜测随机性(即self.random_state_)导致了错误。我正在使用sklearn版本1.0.2

1个回答

1

首先需要注意的是,如果您使用 parametrize_with_checkspytest 而不是 check_estimator,那么您可以获得更好的输出。代码如下:

@parametrize_with_checks([TemplateEstimator()])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

如果你使用pytest运行,你会得到以下失败测试的输出:

FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_pipeline_consistency] - AssertionError: 
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train(readonly_memmap=True)] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train(readonly_memmap=True,X_dtype=float32)] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_regression_target] - AssertionError: Did not raise: [<class 'ValueErr...
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_methods_sample_order_invariance] - AssertionError: 
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_methods_subset_invariance] - AssertionError: 

其中一些测试检查输出的一致性,但在您返回随机值的情况下并不相关。在这种情况下,您需要设置non_deterministic估算器标签。其他一些测试,例如check_classifiers_regression_target,检查您是否进行了正确的验证并引发了正确的错误,而您没有。因此,您需要修复或添加no_validation标签。另一个问题是,check_classifier_train检查您的模型是否为给定问题提供合理的输出。但由于您返回随机值,不满足这些条件。您可以设置poor_score估算器标签以跳过该测试。

您可以通过将以下内容添加到您的估算器中来添加这些标签:

class TemplateEstimator(BaseEstimator, ClassifierMixin):
    ...
    def _more_tags(self):
        return {
            "non_deterministic": True,
            "no_validation": True,
            "poor_score": True,
        }

即使如此,如果您使用scikit-learn的main分支或夜间构建,则会有两个测试失败。我认为这需要修复,因此我已经为此打开了问题(编辑:修复现在已经合并到上游,并将在下一个版本中提供)。您可以通过将这些测试设置为期望失败来避免这些失败。最后,您的估计器将如下所示:

import numpy as np
from sklearn.utils.estimator_checks import parametrize_with_checks
from sklearn.base import BaseEstimator, ClassifierMixin, check_X_y
from sklearn.utils.validation import check_array, check_is_fitted, check_random_state


class TemplateEstimator(BaseEstimator, ClassifierMixin):
    def __init__(self, threshold=0.5, random_state=None):
        self.threshold = threshold
        self.random_state = random_state

    def fit(self, X, y):
        self.random_state_ = check_random_state(self.random_state)
        X, y = check_X_y(X, y)
        self.classes_ = np.unique(y)
        self.fitted_ = True
        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        y_hat = self.random_state_.choice(self.classes_, size=X.shape[0])
        return y_hat

    def _more_tags(self):
        return {
            "non_deterministic": True,
            "no_validation": True,
            "poor_score": True,
            "_xfail_checks": {
                "check_methods_sample_order_invariance": "This test shouldn't be running at all!",
                "check_methods_subset_invariance": "This test shouldn't be running at all!",
            },
        }


@parametrize_with_checks([TemplateEstimator()])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

哇,感谢您的所有解释和开放PR。非常好的回应。我看到他们已经准备合并修复了。 - ronswamson

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接