Scikit-Learn：如何执行最佳子集GLM泊松回归？

Question

Scikit-Learn：如何执行最佳子集GLM泊松回归？

3

请问有谁能向我展示如何使用Pipeline和GridSearchCV执行最佳子集GLM泊松回归？具体而言，我不知道scikit-learn的哪个函数用于最佳子集选择，也不知道如何将其嵌入到管道和GridSearchCV中。

此外，我该如何在特征中包含交互项，并从最佳子集算法中进行选择？如何将其嵌入到管道和GridSearchCV中？

from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor =  ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
                                                  ('discrete',discrete_transformer,discrete_col)],remainder='passthrough')

pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                           ('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])

param_grid = {  ??? different combinations of features ????}

gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)

- user1769197

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Yev Guyduy · Accepted Answer

据我了解，目前sklearn没有针对最佳子集的“暴力”/穷举特征搜索功能。然而，有各种类：

https://scikit-learn.org/0.15/modules/classes.html#module-sklearn.feature_selection
其中包括一些隐藏的有趣类别 --- https://scikit-learn.org/0.15/auto_examples/feature_stacker.html#example-feature-stacker-py。

现在，对于这个问题的流水线处理可能会有些棘手。当您将类/方法堆叠在一个管道中并调用.fit()时，直到最后一个方法，所有方法都必须公开.transform()。如果一个方法公开了.transform()，那么这个.transform()将被用作下一步的输入等等。在最后一步中，您可以将任何有效的模型作为最终对象，但是之前的所有步骤都必须公开.transform()以便链接到另一个步骤。因此，根据您选择的特征选择方法，您的代码将有所不同。请参见下文

巴勃罗·毕加索曾经说过，“好的艺术家会借鉴，伟大的艺术家会窃取。”... 因此，在这个很棒的回答https://dev59.com/ZZ_ha4cB1Zd3GeqPuzLW#42271829的指引下，我们来借鉴、修正并进一步扩展。

导入

### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop


from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor

### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

数据

X, y = load_diabetes(as_frame=True, return_X_y=True)

补充函数



### make parameter grid for your GridSearchCV 
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).items():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

    return final_params

#1 使用RFE特征选择类的示例

(也称特征选择类不返回转换结果，但是是包装类型)

### pipelines work from one step to another as long as previous step returns transform 

### adjust next steps to fit your problem space
### below in all_params_grid 

### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is 
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html 

pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
                  'classifier':['rf']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
                          'with_mean':[True,False]
                         }, 

                   'rf':{'object':RFE(estimator=PoissonRegressor(), 
                                        step=1,
                                        verbose=0),
                         'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to  1  for example to see how it influences accuracy of your grid search
                         'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
                         'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',RFE(estimator=PoissonRegressor()))])
pipe

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

#2 使用KBest特征选择类的示例

(一个展示特征选择.expose .transform()方法的示例)


pipeline_steps = {'transform':['ss'],
                  'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   
                   'kbest': {'object': SelectKBest(),
                             'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
                             },

                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
                       ('classifier',PoissonRegressor())])
pipe

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

#3 暴力破解/使用管道循环遍历所有可能的组合

pipeline_steps = {'transform':['ss'],
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',PoissonRegressor())])
pipe

feature_combo = []  ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score

stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
    for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
        ### run it
        gs_en_cv = GridSearchCV(pipe, 
                                param_grid=param_grids_list,
                                cv=KFold(n_splits=3,
                                         shuffle = True,
                                         random_state=123),
                               scoring = 'neg_root_mean_squared_error',
                               return_train_score=True,

                                ### change verbose to higher number for more print outs
                                ### about fitting info which can also verify that 
                                ### all parameters you specify are getting fit 
                               verbose = 0)

        fitted = gs_en_cv.fit(X[list(subset)],y)
    
        score.append(fitted.best_score_) ### append results
        params.append(fitted.best_params_) ### append results
        feature_combo.append(list(subset)) ### append results

### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
                   'score':score,
                   'params':params})

df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]

PS
对于交互作用（我猜你指的是通过组合原始特征来创建新特征？），我会在.fit()之前直接创建这些特征交互，并将它们包含在其中，因为否则你怎么知道例如你是否获得了最佳交互特征，因为你是在选择子集之后进行交互的之后？为什么不从一开始就相互作用，让gridCV的特征选择部分告诉你最好的方法呢？