StatsModels中的置信区间和预测区间

Question

StatsModels中的置信区间和预测区间

62

我使用 StatsModels 进行这个 线性回归 ：

import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

n = 100

x = np.linspace(0, 10, n)
e = np.random.normal(size=n)
y = 1 + 0.5*x + 2*e
X = sm.add_constant(x)

re = sm.OLS(y, X).fit()
print(re.summary())

prstd, iv_l, iv_u = wls_prediction_std(re)

我的问题是，iv_l和iv_u是上限和下限的置信区间还是预测区间？

我如何获得其他的置信区间和预测区间？

我需要所有点的置信区间和预测区间来制作图表。

- F.N.B

1

https://dev59.com/j14d5IYBdhLWcg3wG_aH#63560689 - Marco Cerliani

7个回答

49

更新请查看第二个答案，该答案更为新近。现在许多模型和结果类都有一个get_prediction方法，提供了额外的信息，包括预测均值的预测区间和/或置信区间。

旧答案：

iv_l和iv_u给出了每个点预测区间的上下限。

预测区间是观测值的置信区间，包括误差估计。

我认为，在statsmodels中，平均预测的置信区间尚未可用。（实际上，拟合值的置信区间隐藏在influence_outlier的summary_table中，但我需要验证这一点。）

适当的预测方法已列入statsmodels的待办事项列表。

补充：

OLS的置信区间存在，但访问起来有点麻烦。

在运行脚本后添加：

from statsmodels.stats.outliers_influence import summary_table

st, data, ss2 = summary_table(re, alpha=0.05)

fittedvalues = data[:, 2]
predict_mean_se  = data[:, 3]
predict_mean_ci_low, predict_mean_ci_upp = data[:, 4:6].T
predict_ci_low, predict_ci_upp = data[:, 6:8].T

# Check we got the right things
print np.max(np.abs(re.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))

plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
plt.show()

在此输入图片描述

这应该会产生与SAS相同的结果，http://jpktd.blogspot.ca/2012/01/nice-thing-about-seeing-zeros.html

- Josef

1

这种方法的问题在于，如果点很稀疏，则绘制predict_mean_ci_low和predict_mean_ci_upp时会出现锯齿状/尖锐的情况，因为它们仅存在于拟合值上，而不是一系列的点上。然而，拟合线对所有点都有定义。https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/outliers_influence.py#L693中有一条注释，称“仅使用帽子矩阵对拟合值有效”-有任何想法如何解决？ - Andrew Mao

我在将这个答案应用到我的数据集时遇到了问题，已经在这里发布了一个单独的问题：https://dev59.com/rZPfa4cB1Zd3GeqPCmMi。非常感谢任何建议！ - pjw

这是一个老问题，但基于这个答案，如何只获取那些低于95 CI的数据点？我将其发布为新问题https://stackoverflow.com/questions/50585837/data-points-below-confidence-interval - PedroA

当使用“fit_regularized()”时，难道没有一种方法可以做到同样的效果吗？似乎所有的方法都适用于正常的“fit()”。 - azal

现在，在OLS非线性曲线中，C.I.是可能的，但参数是线性的。 - JeeyCi

4

使用 get_forecast() 方法可以得到时间序列结果，从而获得更加平滑的图形。以下是时间序列的示例：

# Seasonal Arima Modeling, no exogenous variable
model = SARIMAX(train['MI'], order=(1,1,1), seasonal_order=(1,1,0,12), enforce_invertibility=True)

results = model.fit()

results.summary()

下一步是进行预测，这将生成置信区间。

# make the predictions for 11 steps ahead
predictions_int = results.get_forecast(steps=11)
predictions_int.predicted_mean

这些可以放在数据框中，但需要进行一些清理：

# get a better view
predictions_int.conf_int()

将数据框连接起来，但清理表头

conf_df = pd.concat([test['MI'],predictions_int.predicted_mean, predictions_int.conf_int()], axis = 1)

conf_df.head()

然后我们重命名列。

conf_df = conf_df.rename(columns={0: 'Predictions', 'lower MI': 'Lower CI', 'upper MI': 'Upper CI'})
conf_df.head()

制作图表。

# make a plot of model fit
# color = 'skyblue'

fig = plt.figure(figsize = (16,8))
ax1 = fig.add_subplot(111)


x = conf_df.index.values


upper = conf_df['Upper CI']
lower = conf_df['Lower CI']

conf_df['MI'].plot(color = 'blue', label = 'Actual')
conf_df['Predictions'].plot(color = 'orange',label = 'Predicted' )
upper.plot(color = 'grey', label = 'Upper CI')
lower.plot(color = 'grey', label = 'Lower CI')

# plot the legend for the first plot
plt.legend(loc = 'lower left', fontsize = 12)


# fill between the conf intervals
plt.fill_between(x, lower, upper, color='grey', alpha='0.2')

plt.ylim(1000,3500)

plt.show()

- Bryan Butler

3

summary_frame和summary_table在您需要单个分位数的确切结果时效果很好，但不适合向量化处理。这将为预测区间（而非置信区间）提供正常近似值，并适用于分位数向量：

def ols_quantile(m, X, q):
  # m: Statsmodels OLS model.
  # X: X matrix of data to predict.
  # q: Quantile.
  #
  from scipy.stats import norm
  mean_pred = m.predict(X)
  se = np.sqrt(m.scale)
  return mean_pred + norm.ppf(q) * se

- Max Ghenis

3

补充Max Ghenis的回答-您可以使用.get_prediction()生成置信区间，而不仅仅是预测区间，方法是在其后使用.conf_int()。

predictions = result.get_prediction(out_of_sample_df)
predictions.conf_int(alpha = 0.05)

- Arthur Vandervoort

3

你可以使用我存储库中的Ipython笔记本中的LRPI()类获取预测区间（https://github.com/shahejokarian/regression-prediction-interval）。您需要设置t值以获取所需的置信度水平，否则默认为95%置信度水平。LRPI类使用sklearn.linear_model的LinearRegression，numpy和pandas库。笔记本中也有一个示例。

- Shahe Jokarian

0

您可以根据statsmodel和正态性假设给出的结果来计算它们。

以下是OLS和均值CI的示例：

import statsmodels.api as sm
import numpy as np
from scipy import stats

#Significance level:
sl = 0.05
#Evaluate mean value at a required point x0. Here, at the point (0.0,2.0) for N_model=2:
x0 = np.asarray([1.0, 0.0, 2.0])# If you have no constant in your model, remove the first 1.0. For more dimensions, add the desired values.

#Get an OLS model based on output y and the prepared vector X (as in your notation):
model = sm.OLS(endog = y, exog = X )
results = model.fit()
#Get two-tailed t-values:
(t_minus, t_plus) = stats.t.interval(alpha = (1.0 - sl), df =  len(results.resid) - len(x0) )
y_value_at_x0 = np.dot(results.params, x0)
lower_bound = y_value_at_x0 + t_minus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))
upper_bound = y_value_at_x0 +  t_plus*np.sqrt(results.mse_resid*( np.dot(np.dot(x0.T,results.normalized_cov_params),x0) ))

你可以用输入结果、点x0和显著性水平sl来包装一个好的函数。

我现在不确定你是否可以将其用于WLS()，因为那里有额外的事情发生。

参考文献：[D.C. Montgomery and E.A. Peck. “Introduction to Linear Regression Analysis.” 4th. Ed., Wiley, 1992]中的第3章。

- fabrica

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Julius · Accepted Answer

您可以尝试使用以下内容作为测试数据。

predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)

我在这里找到了summary_frame()方法的信息，可以在这里看到，而get_prediction()方法则可以在这里找到。您可以通过修改"alpha"参数来改变置信区间和预测区间的显著性水平。

我在这里发布帖子是因为这是在寻找置信区间和预测区间解决方案时出现的第一篇文章——尽管它只涉及测试数据。

以下是一个函数，可以使用这种方法来接受模型、新数据和任意分位数：

def ols_quantile(m, X, q):
  # m: OLS model.
  # X: X matrix.
  # q: Quantile.
  #
  # Set alpha based on q.
  a = q * 2
  if q > 0.5:
    a = 2 * (1 - q)
  predictions = m.get_prediction(X)
  frame = predictions.summary_frame(alpha=a)
  if q > 0.5:
    return frame.obs_ci_upper
  return frame.obs_ci_lower