我预期LASSO在没有惩罚因子($\lambda=0$)的情况下会产生与OLS拟合相同(或非常相似)的系数估计值。然而,当我将相同的数据(x,y)放入以下两个模型中:
glmnet(x, y , alpha=1, lambda=0)
用于LASSO拟合且不使用惩罚因子lm(y ~ x)
用于OLS拟合。
为什么会出现不同的系数估计结果呢?
我预期LASSO在没有惩罚因子($\lambda=0$)的情况下会产生与OLS拟合相同(或非常相似)的系数估计值。然而,当我将相同的数据(x,y)放入以下两个模型中:
glmnet(x, y , alpha=1, lambda=0)
用于LASSO拟合且不使用惩罚因子lm(y ~ x)
用于OLS拟合。为什么会出现不同的系数估计结果呢?
我曾经遇到同样的问题,询问了周围的人但没有得到答案,后来我给包维护者(Trevor Hastie)发邮件询问,他给出了解决方案。当系列高度相关时会出现该问题。解决方案是通过减少在glmnet()
函数调用中的阈值(而不是通过glmnet.control()
)来解决。下面的代码使用内置数据集EuStockMarkets
并应用lambda=0
的VAR。对于XSMI,OLS系数低于1,glmnet
默认系数高于1,差约为0.03,并且带有thresh=1e-14
的glmnet
系数非常接近OLS系数(差异为1.8e-7
)。
# Use built-in panel data with integrated series
data("EuStockMarkets")
selected_market <- 2
# Take logs for good measure
EuStockMarkets <- log(EuStockMarkets)
# Get dimensions
num_entities <- dim(EuStockMarkets)[2]
num_observations <- dim(EuStockMarkets)[1]
# Build the response with the most recent observations at the top
Y <- as.matrix(EuStockMarkets[num_observations:2, selected_market])
X <- as.matrix(EuStockMarkets[(num_observations - 1):1, ])
# Run OLS, which adds an intercept by default
ols <- lm(Y ~ X)
ols_coef <- coef(ols)
# run glmnet with lambda = 0
fit <- glmnet(y = Y, x = X, lambda = 0)
lasso_coef <- coef(fit)
# run again, but with a stricter threshold
fit_threshold <- glmnet(y = Y, x = X, lambda = 0, thresh = 1e-14)
lasso_threshold_coef <- coef(fit_threshold)
# build a dataframe to compare the two approaches
comparison <- data.frame(ols = ols_coef,
lasso = lasso_coef[1:length(lasso_coef)],
lasso_threshold = lasso_threshold_coef[1:length(lasso_threshold_coef)]
)
comparison$difference <- comparison$ols - comparison$lasso
comparison$difference_threshold <- comparison$ols - comparison$lasso_threshold
# Show the two values for the autoregressive parameter and their difference
comparison[1 + selected_market, ]
R
返回:
ols lasso lasso_threshold difference difference_threshold
XSMI 0.9951249 1.022945 0.9951248 -0.02782045 1.796699e-07
您在使用该函数时有误。应将x
作为模型矩阵,而非原始预测值。这样做可以得到完全相同的结果:
x <- rnorm(500)
y <- rnorm(500)
mod1 <- lm(y ~ x)
xmm <- model.matrix(mod1)
mod2 <- glmnet(xmm, y, alpha=1, lambda=0)
coef(mod1)
coef(mod2)
我已经运行了Hastie的书中关于“前列腺”示例数据集的下一个代码:
out.lin1 = lm( lpsa ~ . , data=yy )
out.lin1$coeff
out.lin2 = glmnet( as.matrix(yy[ , -9]), yy$lpsa, family="gaussian", lambda=0, standardize=T )
coefficients(out.lin2)
系数的结果非常相似。当我们使用标准化选项时,由glmnet()返回的系数以输入变量的原始单位表示。请确认您正在使用“高斯”系列。