在线性回归中比较StandardScaler和Normalizer的结果

Question

在线性回归中比较StandardScaler和Normalizer的结果

pythonmachine-learningscikit-learnlinear-regression

25

我正在研究一些不同场景下的线性回归示例，比较使用 Normalizer 和 StandardScaler 的结果，并且结果令人困惑。

我正在使用波士顿住房数据集，并准备好这样做：

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

#load the data
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE'] = boston.target

我目前正在试图理解以下场景得到的结果：

使用参数 normalize=True 初始化线性回归 vs 使用 Normalizer
使用参数 fit_intercept=False 初始化线性回归并进行标准化与不标准化比较。

总的来说，我觉得结果令人困惑。

这是我设置一切的方法：

# Prep the data
X = df.iloc[:, :-1]
y = df.iloc[:, -1:]
normal_X = Normalizer().fit_transform(X)
scaled_X = StandardScaler().fit_transform(X)

#now prepare some of the models
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
reg3 = LinearRegression().fit(normal_X, y)
reg4 = LinearRegression().fit(scaled_X, y)
reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)

然后，我创建了3个单独的数据框来比较每个模型的R分数、系数值和预测结果。

为了创建用于比较每个模型系数值的数据框，我执行了以下操作：

#Create a dataframe of the coefficients
coef = pd.DataFrame({
    'coeff':                       reg1.coef_[0],
    'coeff_normalize_true':        reg2.coef_[0],
    'coeff_normalizer':            reg3.coef_[0],
    'coeff_scaler':                reg4.coef_[0],
    'coeff_scaler_no_int':         reg5.coef_[0]
})

这是我创建数据帧以比较每个模型的 R^2 值的方法：

scores = pd.DataFrame({
    'score':                        reg1.score(X, y),
    'score_normalize_true':         reg2.score(X, y),
    'score_normalizer':             reg3.score(normal_X, y),
    'score_scaler':                 reg4.score(scaled_X, y),
    'score_scaler_no_int':          reg5.score(scaled_X, y)
    }, index=range(1)
)

最后，这是一个比较每个预测结果的数据框：

predictions = pd.DataFrame({
    'pred':                        reg1.predict(X).ravel(),
    'pred_normalize_true':         reg2.predict(X).ravel(),
    'pred_normalizer':             reg3.predict(normal_X).ravel(),
    'pred_scaler':                 reg4.predict(scaled_X).ravel(),
    'pred_scaler_no_int':          reg5.predict(scaled_X).ravel()
}, index=range(len(y)))

这是生成的数据框：

系数：

得分：

预测：

我有三个问题无法解决：

为什么前两个模型完全没有区别？看起来设置normalize=False没有任何作用。我可以理解预测值和R ^ 2值相同，但我的特征具有不同的数字比例，因此我不确定归一化为什么根本没有影响。当您考虑使用StandardScaler时，这种情况变得更加令人困惑，因为它会显着改变系数。
我不明白为什么使用Normalizer模型与其他模型相比，系数值会有如此激烈的不同，特别是当使用LinearRegression(normalize=True)模型时根本没有变化。

如果您查看每个文档，则似乎它们非常相似，甚至相同。

在sklearn.linear_model.LinearRegression()的文档中，写道：

normalize：布尔型，可选，默认为False

当fit_intercept设置为False时，将忽略此参数。如果为True，则回归器X将在回归之前通过减去平均值并除以l2范数进行规范化。

同时，sklearn.preprocessing.Normalizer的文档指出默认情况下规范化为l2范数。

我不见得这两个选项有什么区别，而且我也不明白为什么一个会与另一个具有如此激烈的系数差异。

使用StandardScaler的模型结果对我来说是连贯的，但我不理解使用StandardScaler和设置set_intercept=False的模型为什么表现如此糟糕。

从线性回归模块的文档中得知：

fit_intercept : 布尔值，可选参数，默认为True
是否计算此模型的截距。如果设置为False，则在计算中不使用截距（例如，数据已预期为已经居中）。

StandardScaler用于居中数据，所以我不明白为什么将其与fit_intercept=False一起使用会产生不连贯的结果。

- Jonathan Bechtel

3个回答

5

Q1的答案

我假设你所说的前两个模型是reg1和reg2。如果不是，请告诉我们。

线性回归在数据归一化或不归一化时具有相同的预测能力。因此，使用normalize=True对预测结果没有影响。理解这一点的一种方式是将归一化（按列）视为对每个列的线性操作（(x-a)/b），而线性回归的数据的线性变换不会影响系数估计，只会改变它们的值。请注意，这种说法对于Lasso / Ridge / ElasticNet并不正确。

那么，为什么系数不同呢？嗯，normalize=True还考虑到用户通常想要的是原始特征上的系数，而不是标准化特征上的系数。因此，它调整了系数。检查这是否有意义的一种方法是使用更简单的示例：

# two features, normal distributed with sigma=10
x1 = np.random.normal(0, 10, size=100)
x2 = np.random.normal(0, 10, size=100)

# y is related to each of them plus some noise
y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)

X = np.array([x1, x2]).T  # X has two columns

reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)

# check that coefficients are the same and equal to [2,1]
np.testing.assert_allclose(reg1.coef_, reg2.coef_) 
np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

这证实了两种方法都正确地捕捉了[x1，x2]和y之间的真实信号，即分别为2和1。

回答Q2

Normalizer不是您所期望的。它对每一行进行逐行归一化。因此，结果会发生巨大变化，并且很可能破坏特征与目标之间的关系，除非是特定情况下（例如TF-IDF）。

为了看到这一点，请假设上面的示例，但考虑一个不相关于y的不同特征x3。使用Normalizer会导致x1受到x3值的修改，从而减弱其与y之间的关系。

模型(1,2)和(4,5)之间系数的差异

系数之间的差异在于，在拟合之前进行标准化时，系数将针对标准化特征，这些系数是我在答案的第一部分中提到的相同系数。它们可以使用reg4.coef_ / scaler.scale_映射到原始参数：

x1 = np.random.normal(0, 10, size=100)
x2 = np.random.normal(0, 10, size=100)
y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)
X = np.array([x1, x2]).T

reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
scaler = StandardScaler()
reg4 = LinearRegression().fit(scaler.fit_transform(X), y)

np.testing.assert_allclose(reg1.coef_, reg2.coef_) 
np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

# here
coefficients = reg4.coef_ / scaler.scale_
np.testing.assert_allclose(coefficients, np.array([2, 1]), rtol=0.01)

这是因为数学上，将z = (x - mu)/sigma设定后，模型 reg4 解决的是 y = a1*z1 + a2*z2 + a0。我们可以通过简单的代数恢复 y 和 x 之间的关系：y = a1*[(x1 - mu1)/sigma1] + a2*[(x2 - mu2)/sigma2] + a0，这可以简化为y = (a1/sigma1)*x1 + (a2/sigma2)*x2 + (a0 - a1*mu1/sigma1 - a2*mu2/sigma2)。

在上述符号表示中，reg4.coef_ / scaler.scale_代表[a1/sigma1，a2/sigma2]，这正是normalize=True做到的，以确保系数相同。

模型5分数的不一致性。

标准化特征具有零均值，但目标变量未必如此。因此，不适合截距会导致模型忽略目标的平均值。在我使用的示例中，y = 3 + ...中的“3”没有被拟合，这自然会降低模型的预测能力。 :)

- Jorge Leitao

2

关于当fit_intercept=0且数据已标准化时结果不一致的问题（第3个问题），还没有得到完整的回答。

提问者可能期望StandardScaler对X和y进行标准化，这将使截距必然为0（证明请参见此处的1/3部分）。

然而，StandardScaler忽略了y。请参见api。

TransformedTargetRegressor提供了解决方案。这种方法也适用于依赖变量的非线性转换，例如y的对数转换（但请考虑这个）。

以下是解决提问者的第3个问题的示例：

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import numpy as np

# define a custom transformer
class stdY(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass
    def fit(self,Y):
        self.std_err_=np.std(Y)
        self.mean_=np.mean(Y)
        return self
    def transform(self,Y):
        return (Y-self.mean_)/self.std_err_
    def inverse_transform(self,Y):
        return Y*self.std_err_+self.mean_

# standardize X and no intercept pipeline
no_int_pipe=make_pipeline(StandardScaler(),LinearRegression(fit_intercept=0)) # only standardizing X, so not expecting a great fit by itself.

# standardize y pipeline
std_lin_reg=TransformedTargetRegressor(regressor=no_int_pipe, transformer=stdY()) # transforms y, estimates the model, then reverses the transformation for evaluating loss.

#after returning to re-read my answer, there's an even easier solution, use StandardScaler as the transfromer:
std_lin_reg_easy=TransformedTargetRegressor(regressor=no_int_pipe, transformer=StandardScaler())

# generate some simple data
X, y, w = make_regression(n_samples=100,
                          n_features=3, # x variables generated and returned 
                          n_informative=3, # x variables included in the actual model of y
                          effective_rank=3, # make less than n_informative for multicollinearity
                          coef=True,
                          noise=0.1,
                          random_state=0,
                          bias=10)

std_lin_reg.fit(X,y)
print('custom transformer on y and no intercept r2_score: ',std_lin_reg.score(X,y))

std_lin_reg_easy.fit(X,y)
print('standard scaler on y and no intercept r2_score: ',std_lin_reg_easy.score(X,y))

no_int_pipe.fit(X,y)
print('\nonly standard scalar and no intercept r2_score: ',no_int_pipe.score(X,y))

which returns

custom transformer on y and no intercept r2_score:  0.9999343800041816

standard scaler on y and no intercept r2_score:  0.9999343800041816

only standard scalar and no intercept r2_score:  0.3319175799267782

- Chappy Hickens

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Venkatachalam · Accepted Answer

第一和第二个模型系数没有区别的原因是，Sklearn在从标准化输入数据计算系数后，在幕后对系数进行了反标准化。参考资料

这种反标准化是因为对于测试数据，我们可以直接应用系数并获得预测结果，而无需对测试数据进行标准化。

因此，设置normalize=True确实会影响系数，但它们不会影响最佳拟合线。

Normalizer根据每个样本（即按行）进行归一化。您可以在此处查看参考代码。

来自文档:

将每个样本单独归一化为单位范数。

而normalize=True根据每列/特征进行标准化。参考资料

这个例子旨在理解数据在不同维度上进行归一化的影响。我们取两个维度x1和x2，y是目标变量。图中用颜色编码了目标变量值。

import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer,StandardScaler
from sklearn.preprocessing.data import normalize

n=50
x1 = np.random.normal(0, 2, size=n)
x2 = np.random.normal(0, 2, size=n)
noise = np.random.normal(0, 1, size=n)
y = 5 + 0.5*x1 + 2.5*x2 + noise

fig,ax=plt.subplots(1,4,figsize=(20,6))

ax[0].scatter(x1,x2,c=y)
ax[0].set_title('raw_data',size=15)

X = np.column_stack((x1,x2))

column_normalized=normalize(X, axis=0)
ax[1].scatter(column_normalized[:,0],column_normalized[:,1],c=y)
ax[1].set_title('column_normalized data',size=15)

row_normalized=Normalizer().fit_transform(X)
ax[2].scatter(row_normalized[:,0],row_normalized[:,1],c=y)
ax[2].set_title('row_normalized data',size=15)

standardized_data=StandardScaler().fit_transform(X)
ax[3].scatter(standardized_data[:,0],standardized_data[:,1],c=y)
ax[3].set_title('standardized data',size=15)

plt.subplots_adjust(left=0.3, bottom=None, right=0.9, top=None, wspace=0.3, hspace=None)
plt.show()

你可以看到图1、2和4中数据的最佳拟合线是相同的，这意味着R2_-score不会因列/特征归一化或标准化数据而改变，只是最终得到不同的co-effs.值。

注意：fig3的最佳拟合线将会不同。

当你将fit_intercept设置为False时，偏差项将从预测中减去。这意味着截距被设为零，否则它本应该是目标变量的平均值。

使用截距为零的prediction在目标变量没有缩放（平均值=0）的问题上表现不佳。你可以在每行中看到22.532的差异，这意味着输出的影响。