Scikit-Learn:如何执行最佳子集GLM泊松回归?

3

请问有谁能向我展示如何使用PipelineGridSearchCV执行最佳子集GLM泊松回归?具体而言,我不知道scikit-learn的哪个函数用于最佳子集选择,也不知道如何将其嵌入到管道和GridSearchCV中

此外,我该如何在特征中包含交互项,并从最佳子集算法中进行选择?如何将其嵌入到管道和GridSearchCV中

from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor =  ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
                                                  ('discrete',discrete_transformer,discrete_col)],remainder='passthrough')

pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                           ('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])

param_grid = {  ??? different combinations of features ????}

gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)
1个回答

4

据我了解,目前sklearn没有针对最佳子集的“暴力”/穷举特征搜索功能。然而,有各种类:


现在,对于这个问题的流水线处理可能会有些棘手。当您将类/方法堆叠在一个管道中并调用.fit()时,直到最后一个方法,所有方法都必须公开.transform()。如果一个方法公开了.transform(),那么这个.transform()将被用作下一步的输入等等。在最后一步中,您可以将任何有效的模型作为最终对象,但是之前的所有步骤都必须公开.transform()以便链接到另一个步骤。因此,根据您选择的特征选择方法,您的代码将有所不同。请参见下文


巴勃罗·毕加索曾经说过,“好的艺术家会借鉴,伟大的艺术家会窃取。”... 因此,在这个很棒的回答https://dev59.com/ZZ_ha4cB1Zd3GeqPuzLW#42271829的指引下,我们来借鉴、修正并进一步扩展。

导入

### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop


from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor

### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

数据

X, y = load_diabetes(as_frame=True, return_X_y=True)

补充函数



### make parameter grid for your GridSearchCV 
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).items():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

    return final_params

#1 使用RFE特征选择类的示例

(也称特征选择类不返回转换结果,但是是包装类型)

### pipelines work from one step to another as long as previous step returns transform 

### adjust next steps to fit your problem space
### below in all_params_grid 

### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is 
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html 

pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
                  'classifier':['rf']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
                          'with_mean':[True,False]
                         }, 

                   'rf':{'object':RFE(estimator=PoissonRegressor(), 
                                        step=1,
                                        verbose=0),
                         'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to  1  for example to see how it influences accuracy of your grid search
                         'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
                         'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

enter image description here

### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',RFE(estimator=PoissonRegressor()))])
pipe

enter image description here

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

enter image description here


#2 使用KBest特征选择类的示例

(一个展示特征选择.expose .transform()方法的示例)


pipeline_steps = {'transform':['ss'],
                  'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   
                   'kbest': {'object': SelectKBest(),
                             'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
                             },

                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

enter image description here

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
                       ('classifier',PoissonRegressor())])
pipe

enter image description here

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

enter image description here


#3 暴力破解/使用管道循环遍历所有可能的组合

pipeline_steps = {'transform':['ss'],
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

enter image description here

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',PoissonRegressor())])
pipe

enter image description here

feature_combo = []  ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score

stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
    for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
        ### run it
        gs_en_cv = GridSearchCV(pipe, 
                                param_grid=param_grids_list,
                                cv=KFold(n_splits=3,
                                         shuffle = True,
                                         random_state=123),
                               scoring = 'neg_root_mean_squared_error',
                               return_train_score=True,

                                ### change verbose to higher number for more print outs
                                ### about fitting info which can also verify that 
                                ### all parameters you specify are getting fit 
                               verbose = 0)

        fitted = gs_en_cv.fit(X[list(subset)],y)
    
        score.append(fitted.best_score_) ### append results
        params.append(fitted.best_params_) ### append results
        feature_combo.append(list(subset)) ### append results

enter image description here

### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
                   'score':score,
                   'params':params})

df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]

enter image description here


PS
对于交互作用(我猜你指的是通过组合原始特征来创建新特征?),我会在.fit()之前直接创建这些特征交互,并将它们包含在其中,因为否则你怎么知道例如你是否获得了最佳交互特征,因为你是在选择子集之后进行交互的之后?为什么不从一开始就相互作用,让gridCV的特征选择部分告诉你最好的方法呢?


1
请注意,这只是一个入门级别的代码,我没有包括ALL https://scikit-learn.org/stable/modules/feature_selection.html 类,但很容易将我的代码扩展到所有类。PoissonRegressor()出现两次,因为一次用于MODEL,另一次在RFE内部使用,需要估计器作为其参数之一。在RFE中包括PoissonRergressorDecisionTreeRegressor意味着CV将把它们视为要尝试的超参数。您也可以在RFE中添加其他模型,它们将在交叉验证中尝试。我只放了2个模型,以便您了解如何利用它。 - Yev Guyduy
1
FYI,我稍微编辑了我的答案,加入了更多的特征选择类和更多的注释。 - Yev Guyduy
1
顺便说一下,我的原始答案让我感到不太舒服,所以我回去进行了实验,并因此进行了重写,在这个过程中还发现了一堆SO答案,它们在其他类似的问题中没有正确使用这些方法...看看更新内容,检查注释以确保输出受到影响。请注意,即使有10个特征,方法#3在我的端上也需要大约3分钟才能运行。 - Yev Guyduy
1
对于#3,SelectKBest使用一个评分函数来选择K个得分最高的特征。例如,您可以将chi2或f_regression作为评分函数传递给它,并设置一个截止点,根据返回的分数选择要使用的特征。它并不真正地在所有可能的组合上进行详尽的搜索。我现在有点忙,但一旦我摆脱了这些事情,我会包括y的SS。 - Yev Guyduy
1
是的,我的#3正好做到了这一点,因为#1和#2都没有做到。 - Yev Guyduy
显示剩余4条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接