如何从pandas数据框创建输入样本以用于LSTM模型?

4
我正在尝试创建一个LSTM模型,它可以给我二进制输出买或不买。我的数据格式为[date_time, close, volume],有数百万行。我卡在将数据格式化为3D;样本、时间步、特征上。我已经使用pandas读取了数据。我想以这种方式格式化数据:每个样本包含400个时间步长,两个特征(close和volume),一共有4000个样本。请问有人能指导我如何做吗?
编辑: 我正在使用TimeseriesGenerator进行操作,但是我不知道如何检查我的序列并将输出Y替换为自己的二进制购买输出。
df = normalize_data(df)

print("Creating sequences for NN \n")
targets = df.drop('date_time', 1)
train = keras.preprocessing.sequence.TimeseriesGenerator(df, targets, 1, sampling_rate=1, stride=1,
                                                         start_index=0, end_index=int(len(df.index)*0.8),
                                                         shuffle=True, reverse=False, batch_size=time_steps)

这段代码没有错误,但输出的结果是输入时间序列后的第一个闭合值。

编辑2: 到目前为止,我的代码看起来像这样:

df = data.normalize_data(df)
targets = df.iloc[:, 3]  # Buy signal target

df.drop('y1', axis=1, inplace=True)
df.drop('y2', axis=1, inplace=True)

train = TimeseriesGenerator(df, targets, length=1, sampling_rate=1, stride=1,
                            start_index=0, end_index=int(len(df.index) * 0.8),
                            shuffle=True, reverse=False, batch_size=time_steps)

# number of samples
print("Samples: " + str(len(train)))
x, y = train[0]
print(str(x))

输出结果如下:
Samples: 8
Traceback (most recent call last):
File "/home/stian/.local/lib/python3.6/site- 
packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in 
pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: range(418, 419)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./main.py", line 94, in <module>
data_menu()
File "./main.py", line 42, in data_menu
data_menu()
File "./main.py", line 56, in data_menu
nn_menu()
File "./main.py", line 76, in nn_menu
nn.nn_gen(pre_processed_data)
File "/home/stian/git/stian9k/nn.py", line 33, in nn_gen
x, y = train[0]
File "/home/stian/.local/lib/python3.6/site-packages/keras_preprocessing/sequence.py", line 378, in __getitem__
samples[j] = self.data[indices]
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/home/stian/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: range(418, 419)

看起来尽管我从生成器中获取了8个对象,但我无法查找它们。如果我测试类型:print(str(type(train))),我会得到TimeseriesGenerator对象。再次感谢您的任何建议。

编辑3: 事实证明,timeseriesgenerator不喜欢pandas数据框架。通过将其转换为numpy数组以及将pandas时间戳类型转换为浮点数来解决了该问题。


EDIT 3解决了我的问题 =) - Psychotechnopath
2个回答

1

谢谢!我从数据框中得到了很多奇怪的数字。在使用它之前,使用to_numpy()进行转换解决了这个问题!

input_convertido = df.to_numpy()
output_convertido = df["close"].to_numpy()
gerador = TimeseriesGenerator(input_convertido, output_convertido, length=n_input, batch_size=1, sampling_rate=1)

1
你可以简单地使用Keras 时间序列生成器 来实现这个目的。你可以轻松设置长度(即每个样本中的时间步数)、采样率和步幅以对数据进行子采样。
它将返回一个Sequence类的实例,然后你可以将其传递给fit_generator来拟合由它生成的数据的模型。我强烈建议阅读文档以获取有关此类、其参数和用法的更多信息。

谢谢您的建议。我已经实现了TimeSeriesGenerator,它可以正常运行,但是您知道如何替换输出Y,使其不是下一个关闭迭代,而是我的自己计算的二进制Y吗? - Stian Hafslund
1
@StianHafslund 您的时间序列中每一行都有标签(例如购买/不购买)吗?如果是这样,只需将包含所有标签的标签数组作为“target”参数传递即可。 - today
我试图打印生成器的输出,但遇到了关键错误。print("样本数:" + str(len(x_train))) x,y = x_train[0] print(str(x))输出是样本数:41,然后在查找x_train [0]时出现错误。model.fit_generator运行一个epoch,然后触发keyerror。 - Stian Hafslund
1
@StianHafslund,您能否尝试使用等效的numpy数组,即将df.valuestargets.values传递给TimeseriesGenerator?可能该实现不支持pandas数据框。 - today
谢谢你的所有帮助!你是正确的。我将其转换为numpy数组,并将我的时间戳转换为浮点数。现在神经网络运行良好。 - Stian Hafslund

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接