Sklearn:如何在MultiOutputRegressor中为每个目标值传递不同的特征?

3
亲爱的同事们,我已经创建了一个scikit-learn管道来训练和调整不同的HistBoostRegressors。
from scipy.stats import loguniform
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import RandomizedSearchCV

class loguniform_int:
    """Integer valued version of the log-uniform distribution"""
    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)


data_train, data_test, target_train, target_test = train_test_split(
    df.drop(columns=TARGETS), 
    df[target_dict], 
    random_state=42)

pipeline_hist_boost_mimo_inside = Pipeline([('scaler', StandardScaler()),
                             ('variance_selector', VarianceThreshold(threshold=0.03)), 
                             ('estimator', MultiOutputRegressor(HistGradientBoostingRegressor(loss='poisson')))])


parameters = {
    'estimator__estimator__l2_regularization': loguniform(1e-6, 1e3),
    'estimator__estimator__learning_rate': loguniform(0.001, 10),
    'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
    'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
    'estimator__estimator__min_samples_leaf': loguniform_int(1, 100),
    'estimator__estimator__max_bins': loguniform_int(2, 255),
}

random_grid_inside = RandomizedSearchCV(estimator=pipeline_hist_boost_mimo_inside, param_distributions=parameters, random_state=0, n_iter=50,
                                       n_jobs=-1, refit=True, cv=3, verbose=True,
                                       pre_dispatch='2*n_jobs', 
                                       return_train_score=True)

results_inside_train = random_grid_inside.fit(data_train, target_train)

然而,我现在想知道是否可以将不同的特征名称传递给step pipeline_hist_boost_mimo_inside["estimator"]。
我注意到在多输出回归器的文档中有一个名为feature_names的参数:
feature_names_in_ndarray of shape (n_features_in_,) Names of features seen during fit. Only defined if the underlying estimators expose such an attribute when fit. New in version 1.0.
我还发现了一些scikit learn列选择器的文档,其中有一个参数:

https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector

patternstr,default = None:包含此正则表达式模式的列的名称将被包括在内。如果为None,则不会基于模式选择列。
问题是这个模式将取决于我正在拟合的目标。
有没有一种优雅的方法来解决这个问题?
编辑:数据集示例:
feat1, feat2, feat3.... target1, target2, target3....

1      47     0.65         0        0.5     0.6

多输出回归器将为每个(feat1、feat2、feat3和targetn)对拟合一个直方图回归器。在下表的示例中,我将拥有一个管道,其估计器步骤将包含一个由3个估计器组成的列表,因为我有3个目标。
问题是如何将例如feat1和feat2传递给target1,但将feat1和feat3传递给target2。

你能详细解释一下“传递不同的特征名称”是什么意思吗?我的第一印象是,你可以使用转换器来删除你不想要的特征,比如DropFeatures,或者你可以创建一个自定义转换器来选择你想要的特征。 - Miguel Trejo
嗨,Miguel。dropfeatures能否根据不同的目标名称工作,因为我的y_train包含12个不同的目标? - tfkLSTM
所以您正在寻找一个类似于column_selectorDropFeatures的转换器,但应用于目标变量?如果您的任务是回归HistGradientBoostingRegressor,为什么您的目标只有12个不同的值? - Miguel Trejo
嗨,Miguel,我的目标有数千个不同的值。我的意思是我有12个不同的目标,这就是为什么我正在使用多输出回归器的原因。 - tfkLSTM
抱歉我还没有完全理解,但是您有一个维度为(n_samples,12)的目标变量矩阵,并且您想在使用MultiOutputRegressor运行模型之前选择其中一些12个特征,例如低维度矩阵(m,7)。您能给出目标变量的示例吗? - Miguel Trejo
嗨,Miguel,那不正确。我有一个X矩阵(n_samples,n_features)和y矩阵(n_samples,n_targets)。我不想减少n_features,我想将不同的特征输入到不同的目标中。我已经添加了数据集形状的示例。 - tfkLSTM
1个回答

2
一种解决方案是修改MultiOutputRegressor,使其可以筛选特定列以适应单独的目标变量。例如,我定义了一个MultiOutputRegressorTargetFilter,它接受一个features_in参数,该参数是一个字典,指示每个目标y值使用哪些列。
import numpy as np
from sklearn.datasets import load_linnerud
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge

X, y = load_linnerud(return_X_y=True)

# Pass a dictionary indicating which columns to use for each target variable value
features_in = {
    0: [0, 2], # Use columns 1 and 3 for y[0]
    1: [1, 2], # Use columns 2 and 3 for y[1]
    2: [0, 1, 2] # Use all columns for y[2]
}

clf = MultiOutputRegressorTargetFilter(Ridge(random_state=123), features_in=features_in).fit(X, y)
clf.predict(X[[0]])

MultiOutputRegressorTargetFilter代码

from sklearn.multioutput import _MultiOutputEstimator
from sklearn.base import RegressorMixin, clone
from sklearn.utils.validation import _check_fit_params, has_fit_parameter, check_is_fitted
from sklearn.utils.fixes import delayed
from joblib import Parallel

import numpy as np

def _fit_estimator(estimator, X, y, sample_weight=None, **fit_params):
    estimator = clone(estimator)
    if sample_weight is not None:
        estimator.fit(X, y, sample_weight=sample_weight, **fit_params)
    else:
        estimator.fit(X, y, **fit_params)
    return estimator

class MultiOutputRegressorTargetFilter(RegressorMixin, _MultiOutputEstimator):
    """Multi target regression.
    This strategy consists of fitting one regressor per target. This is a
    simple strategy for extending regressors that do not natively support
    multi-target regression. This Estimator allows to select different columns
    to fit a model for each of the target values.
    .. versionadded:: 0.18
    
    Parameters
    ----------
    estimator : estimator object
        An estimator object implementing :term:`fit` and :term:`predict`.
        
    features_in : dict
        Dictionary with (key, value) pairs indicating which variables to use
        to fit model at target y.
        
    n_jobs : int or None, optional (default=None)
        The number of jobs to run in parallel.
        :meth:`fit`, :meth:`predict` and :meth:`partial_fit` (if supported
        by the passed estimator) will be parallelized for each target.
        When individual estimators are fast to train or predict,
        using ``n_jobs > 1`` can result in slower performance due
        to the parallelism overhead.
        ``None`` means `1` unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all available processes / threads.
        See :term:`Glossary <n_jobs>` for more details.
        .. versionchanged:: 0.20
            `n_jobs` default changed from `1` to `None`.
    
    Attributes
    ----------
    estimators_ : list of ``n_output`` estimators
        Estimators used for predictions.
    
    n_features_in_ : int
        Number of features seen during :term:`fit`. Only defined if the
        underlying `estimator` exposes such an attribute when fit.
        .. versionadded:: 0.24
    
    feature_names_in_ : ndarray of shape (`n_features_in_`,)
        Names of features seen during :term:`fit`. Only defined if the
        underlying estimators expose such an attribute when fit.
        .. versionadded:: 1.0
    
    See Also
    --------
    RegressorChain : A multi-label model that arranges regressions into a
        chain.
    MultiOutputClassifier : Classifies each output independently rather than
        chaining.
    
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.datasets import load_linnerud
    >>> from sklearn.multioutput import MultiOutputRegressor
    >>> from sklearn.linear_model import Ridge
    >>> X, y = load_linnerud(return_X_y=True)
    >>> clf = MultiOutputRegressor(Ridge(random_state=123)).fit(X, y)
    >>> clf.predict(X[[0]])
    array([[176..., 35..., 57...]])
    """
    
    def __init__(self, estimator, *, n_jobs=None, features_in=None):
        super().__init__(estimator, n_jobs=n_jobs)
        self.features_in = features_in
        
    def fit(self, X, y, sample_weight=None, **fit_params):
        """Fit the model to data, separately for each output variable.
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input data.
        
        y : {array-like, sparse matrix} of shape (n_samples, n_outputs)
            Multi-output targets. An indicator matrix turns on multilabel
            estimation.
        
        sample_weight : array-like of shape (n_samples,), default=None
            Sample weights. If `None`, then samples are equally weighted.
            Only supported if the underlying regressor supports sample
            weights.
        
        **fit_params : dict of string -> object
            Parameters passed to the ``estimator.fit`` method of each step.
            .. versionadded:: 0.23
        
        Returns
        -------
        self : object
            Returns a fitted instance.
        """

        if not hasattr(self.estimator, "fit"):
            raise ValueError("The base estimator should implement a fit method")

        y = self._validate_data(X="no_validation", y=y, multi_output=True)

        if y.ndim == 1:
            raise ValueError(
                "y must have at least two dimensions for "
                "multi-output regression but has only one."
            )

        if sample_weight is not None and not has_fit_parameter(
            self.estimator, "sample_weight"
        ):
            raise ValueError("Underlying estimator does not support sample weights.")

        fit_params_validated = _check_fit_params(X, fit_params)

        self.estimators_ = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_estimator)(
                self.estimator, X[:, self.features_in[i]], y[:, i], sample_weight, **fit_params_validated
            )
            for i in range(y.shape[1])
        )

        if hasattr(self.estimators_[0], "n_features_in_"):
            self.n_features_in_ = self.estimators_[0].n_features_in_
        if hasattr(self.estimators_[0], "feature_names_in_"):
            self.feature_names_in_ = self.estimators_[0].feature_names_in_

        return self
    
    def predict(self, X):
        """Predict multi-output variable using model for each target variable.
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            The input data.
        Returns
        -------
        y : {array-like, sparse matrix} of shape (n_samples, n_outputs)
            Multi-output targets predicted across multiple predictors.
            Note: Separate models are generated for each predictor.
        """
        check_is_fitted(self)
        if not hasattr(self.estimators_[0], "predict"):
            raise ValueError("The base estimator should implement a predict method")

        y = Parallel(n_jobs=self.n_jobs)(
            delayed(e.predict)(X[:, self.features_in[i]]) for i, e in enumerate(self.estimators_)
        )

        return np.asarray(y).T

@tfkLSTM 这个回答有帮助吗? - Miguel Trejo
嗨,Miguel,这似乎非常接近我想要实现的目标。我需要几天时间进行测试,但我会尽快回来。 - tfkLSTM

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接