你的实际值是否被限制在 [-0.05, 0.05)
范围内?
当我最初构建一个样本数据集来研究你的问题时,我在 [0,1]
中随机生成浮点数。当我这样做时,我也得到了你观察到的相同结果 - 对于每个序列的第六个条目,总是选择 (a,b,c)
的最大值,并且总是相同的预测类别。但考虑到我的数据分布(均匀分布在 0
和 1
之间)比第六个条目的网格搜索值(在 -.05
和 .05
之间)具有更大的中心倾向性,因此 HMM 始终选择最高值的三元组 (.04,.04,.04)
,因为它最接近其训练的分布的主密度。
当我使用与我们允许第六个条目相同范围内的可能值的均匀分布进行抽取时,输出变化更多:每个序列的 O_t+1
选择和类别预测都显示出合理的差异。从你的示例数据中,似乎你至少拥有正值和负值,但你可以尝试绘制每个特征的分布,并查看你可能的第六个条目值的范围是否都是合理的。
这里有一些样本数据和评估代码。每次出现新的最优 (a,b,c)
序列或第六项观测的预测发生变化时,它都会打印出一条消息(只是为了表明它们并不全都相同)。每个 6 元素序列的最高可能性,以及预测和最佳的第六个数据点存储在 best_per_span
中。
首先,构建一个样本数据集:
import numpy as np
import pandas as pd
dates = pd.date_range(start="01-01-2001", end="31-12-2001", freq='D')
n_obs = len(dates)
n_feat = 3
values = np.random.uniform(-.05, .05, size=n_obs*n_feat).reshape((n_obs,n_feat))
df = pd.DataFrame(values, index=dates)
df.head()
0 1 2
2001-01-01 0.020891 -0.048750 -0.027131
2001-01-02 0.013571 -0.011283 0.041322
2001-01-03 -0.008102 0.034088 -0.029202
2001-01-04 -0.019666 -0.005705 -0.003531
2001-01-05 -0.000238 -0.039251 0.029307
现在将数据集分成训练集和测试集:
train_pct = 0.7
train_size = round(train_pct*n_obs)
train_ix = np.random.choice(range(n_obs), size=train_size, replace=False)
train_dates = df.index[train_ix]
train = df.loc[train_dates]
test = df.loc[~df.index.isin(train_dates)]
train.shape
test.shape
在训练数据上拟合3状态HMM:
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore",category=DeprecationWarning)
from hmmlearn import hmm
mdl = hmm.GaussianHMM(n_components=3, covariance_type='diag', n_iter=1000)
mdl.fit(train)
现在进行网格搜索以找到最佳的第六个(t+1
)观测值:
span = 5
best_per_span = []
current_abc = None
current_pred = None
for start in range(len(test)-span):
flag = False
end = start + span
first_five = test.iloc[start:end].values
output = []
for a in np.arange(-0.05,0.05,.01):
for b in np.arange(-0.05,0.05,.01):
for c in np.arange(-0.05,0.05,.01):
sixth = np.array([a, b, c])[:, np.newaxis].T
all_six = np.append(first_five, sixth, axis=0)
output.append((mdl.decode(all_six), (a,b,c)))
best = max(output, key=lambda x: x[0][0])
best_dict = {"start":start,
"end":end,
"sixth":best[1],
"preds":best[0][1],
"lik":best[0][0]}
best_per_span.append(best_dict)
if best_dict["sixth"] != current_abc:
current_abc = best_dict["sixth"]
flag = True
print("New abc for range {}:{} = {}".format(start, end, current_abc))
if best_dict["preds"][-1] != current_pred:
current_pred = best_dict["preds"][-1]
flag = True
print("New pred for 6th position: {}".format(current_pred))
if flag:
print("Test sequence StartIx: {}, EndIx: {}".format(start, end))
print("Best 6th value: {}".format(best_dict["sixth"]))
print("Predicted hidden state sequence: {}".format(best_dict["preds"]))
print("Likelihood: {}\n".format(best_dict["nLL"]))
循环运行时的报告输出示例:
New abc for range 3:8 = [-0.01, 0.01, 0.0]
New pred for 6th position: 1
Test sequence StartIx: 3, EndIx: 8
Best 6th value: [-0.01, 0.01, 0.0]
Predicted hidden state sequence: [0 2 2 1 0 1]
Likelihood: 35.30144407374163
New abc for range 18:23 = [-0.01, -0.01, -0.01]
New pred for 6th position: 2
Test sequence StartIx: 18, EndIx: 23
Best 6th value: [-0.01, -0.01, -0.01]
Predicted hidden state sequence: [0 0 0 1 2 2]
Likelihood: 34.31813078939214
best_per_span
的示例输出如下:
[{'end': 5,
'lik': 33.791537281734904,
'preds': array([0, 2, 0, 1, 2, 2]),
'sixth': [-0.01, -0.01, -0.01],
'start': 0},
{'end': 6,
'lik': 33.28967307589143,
'preds': array([0, 0, 1, 2, 2, 2]),
'sixth': [-0.01, -0.01, -0.01],
'start': 1},
{'end': 7,
'lik': 34.446813870838156,
'preds': array([0, 1, 2, 2, 2, 2]),
'sixth': [-0.01, -0.01, -0.01],
'start': 2}]
除了报告元素外,这并不是对您最初方法的重大更改,但它似乎按预期工作,而不会每次都达到最大值。