在scikit-learn中，pipeline和make_pipeline有什么区别？

Question

在scikit-learn中，pipeline和make_pipeline有什么区别？

pythonmachine-learningscikit-learnpipeline

87

我从 sklearn 网页上得到了以下内容：

Pipeline: 带有最终估计器的转换流水线
Make_pipeline: 从给定估计器构造管道。这是 Pipeline 构造函数的简写。

但我仍然不明白何时应该使用每个选项。有人可以给我举个例子吗？

- Aizzaac

2个回答

2

如果我们看一下源代码，make_pipeline()创建了一个Pipeline对象，因此它们是等效的。正如@Mikhail Korobov所提到的，唯一的区别在于make_pipeline()不接受估计器名称，而是将它们设置为其类型的小写形式。换句话说，type(estimator).__name__.lower()用于创建估计器名称(源)。因此，这实际上是构建管道的更简单形式。

相关说明，要获取参数名称，您可以使用get_params()。如果您想知道GridSearch()的参数名称，这非常有用。参数名称是由递归地将估算器名称与其kwargs连接起来形成的(e.g. LogisticRegression()的max_iter存储为'logisticregression__max_iter'或者OneVsRestClassifier(LogisticRegression())中的C参数存储为'onevsrestclassifier__estimator__C'；后者是因为在使用kwargs编写时，它是OneVsRestClassifier(estimator=LogisticRegression())）。

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification

X, y = make_classification()
pipe = make_pipeline(PCA(), LogisticRegression())

print(pipe.get_params())

# {'memory': None,
#  'steps': [('pca', PCA()), ('logisticregression', LogisticRegression())],
#  'verbose': False,
#  'pca': PCA(),
#  'logisticregression': LogisticRegression(),
#  'pca__copy': True,
#  'pca__iterated_power': 'auto',
#  'pca__n_components': None,
#  'pca__n_oversamples': 10,
#  'pca__power_iteration_normalizer': 'auto',
#  'pca__random_state': None,
#  'pca__svd_solver': 'auto',
#  'pca__tol': 0.0,
#  'pca__whiten': False,
#  'logisticregression__C': 1.0,
#  'logisticregression__class_weight': None,
#  'logisticregression__dual': False,
#  'logisticregression__fit_intercept': True,
#  'logisticregression__intercept_scaling': 1,
#  'logisticregression__l1_ratio': None,
#  'logisticregression__max_iter': 100,
#  'logisticregression__multi_class': 'auto',
#  'logisticregression__n_jobs': None,
#  'logisticregression__penalty': 'l2',
#  'logisticregression__random_state': None,
#  'logisticregression__solver': 'lbfgs',
#  'logisticregression__tol': 0.0001,
#  'logisticregression__verbose': 0,
#  'logisticregression__warm_start': False}

# use the params from above to construct param_grid
param_grid = {'pca__n_components': [2, None], 'logisticregression__C': [0.1, 1]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

best_score = gs.score(X, y)

回到Pipeline与make_pipeline的问题上；Pipeline在参数命名方面更加灵活，但如果您使用每个估计器类型的小写来命名每个估计器，则Pipeline和make_pipeline都将具有相同的params和steps属性。

pca = PCA()
lr = LogisticRegression()
make_pipe = make_pipeline(pca, lr)
pipe = Pipeline([('pca', pca), ('logisticregression', lr)])

make_pipe.get_params() == pipe.get_params()   # True
make_pipe.steps == pipe.steps                 # True

- cottontail

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mikhail Korobov · Accepted Answer

唯一的区别在于make_pipeline会自动生成步骤名称。

步骤名称在某些情况下是必需的，例如如果您想使用带有模型选择实用程序（例如GridSearchCV）的管道。使用网格搜索时，您需要为管道的各个步骤指定参数：

pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

与 make_pipeline 进行比较：

pipe = make_pipeline(CountVectorizer(), LogisticRegression())     
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

因此，使用 Pipeline：

名称是明确的，如果您需要它们，您不必弄清楚它们;
如果您在步骤中更改了估算器/转换器（例如，用LinearSVC()替换LogisticRegression()），名称不会更改，仍然可以使用clf__C。

make_pipeline：

更简短且易读的符号表示法;
名称是使用简单规则（估算器的小写名称）自动生成的。

何时使用它们取决于您 :) 我喜欢使用 make_pipeline 进行快速实验，使用 Pipeline 进行更稳定的代码; 一个经验法则：IPython笔记本-> make_pipeline; 大型项目中的Python模块->Pipeline。但是，在模块中使用 make_pipeline 或在短脚本或笔记本中使用 Pipeline 绝对不是什么大问题。