计算集合模型的估计误差标准差

3

我有一个模型,想要分析残差。最终,我想要识别出每天超出置信区间的极端残差。但是我在计算装袋回归器中每个模型的点态残差标准差时遇到了麻烦。

以下是我的示例代码:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor

# Sample DataFrame
df = pd.DataFrame(np.random.randint(0,200,size=(500, 4)), columns=list('ABCD'))

# Add dates to sample data
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(500)]
df['date'] = date_list
df['date'] = df['date'].astype('str')

# Split dataset into testing and training
train = df[:int(len(df)*0.80)]
test = df[int(len(df)*0.20):]

X_train = train[['B','C','D','date']]
X_test = test[['B','C','D','date']]

y_train = train[['A']]
y_test = test[['A']]

# Function to Encode the data
def encode_and_bind(data_in, feature_to_encode):
    dummies = pd.get_dummies(data_in[[feature_to_encode]])
    data_out = pd.concat([data_in, dummies], axis=1)
    data_out = data_out.drop([feature_to_encode], axis=1)
    return(data_out)

for feature in features_to_encode:
  X_train_final = encode_and_bind(X_train, 'date') 
  X_test_final = encode_and_bind(X_test, 'date')

# Define Model
svr_lin = SVR(kernel="linear", C=100, gamma="auto")
regr = BaggingRegressor(base_estimator=svr_lin,random_state=5).fit(X_train_final, y_train.values.ravel())

# Predictions
y_pred = regr.predict(X_test_final)

# Join the predictions back into orignial dataframe
y_test['predict'] = y_pred

# Calculate residuals
y_test['residuals'] = y_test['A'] - y_test['predict']

我在网上找到了这个方法

raw_pred = [x.predict([[0, 0, 0, 0]]) for x in regr.estimators_]

但我不确定�� x.predict([[0, 0, 0, 0]]) 部分应该使用什么,因为我的特征远远超过4个。

编辑:

在 @2MuchC0ff33 的答案基础上,我尝试了以下操作:

stdevs = []

for dates in X_test_final.columns[3:]:
  test = X_test_final[X_test_final[dates]==1]
  raw_pred = [x.predict([test.iloc[0]]) for x in regr.estimators_]

  dates= dates
  sdev= np.std(raw_pred)
  sdev = sdev.astype('str')
  stdevs.append(dates + "," + sdev)

看上去是正确的,但我对这些计算方式的了解还不足以判断它是否按照我想的方式工作。

1个回答

3

F,感谢您分享对我的回答的尝试。

我将尝试将所有内容分解,并希望为您提供所需的解决方案。如果我重复了您的某些代码,请提前道歉,这是我的大脑工作方式哈哈。

您可以按日期对残差进行分组,并计算每个组的标准偏差,以计算每天残差的点态标准偏差。以下是具体步骤:

y_test['date'] = y_test['date'].apply(lambda x: x[:10])
grouped = y_test.groupby(['date'])
residual_groups = grouped['residuals']
residual_stds = residual_groups.std()

这将为您提供每天的残差标准偏差。对于每一天,将标准偏差乘以一个常数,例如1.96(用于95%置信区间),并从残差的平均值中加/减。

residual_means = residual_groups.mean()
CI = 1.96 * residual_stds
upper_bound = residual_means + CI
lower_bound = residual_means - CI

最后,通过将残差与下限和上限进行比较,您可以确定每天超出置信区间的极端残差。
extreme_residuals = y_test[(y_test['residuals'] > upper_bound) | (y_test['residuals'] < lower_bound)]

你可以扩展这个方法来计算每天的标准差。
# Group the test data by the date feature
grouped = X_test_final.groupby(['date'])

stdevs = []
for name, group in grouped:
  raw_pred = [x.predict(group) for x in regr.estimators_]
  # Calculate the standard deviation of the predictions for each group
  sdev = np.std(raw_pred)
  stdevs.append((name, sdev))

我认为我们可以用x_test_final替换0, 0, 0, 0。请看下面我更新的方法,让我知道你的想法:

raw_pred = [x.predict([X_test_final.iloc[0]]) for x in regr.estimators_]

太好了,那么我可以使用np.std(raw_pred)来计算标准差。但是我不确定如何扩展此方法以找到每天的标准差。 - Rbc.F
1
我尝试将其分解并希望能为您提供解决方案。*祈求好运 - 2MuchC0ff33

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接