试图使用OneHotEncoder进行编码后,使用make_column_transformer对值进行归一化时出现ValueError。

5

我试图将数据的时间戳从Unix时间戳转换为更易读的日期格式。我创建了一个简单的Java程序来完成这个任务并将结果写入.csv文件,这一步进行得很顺利。然后,我尝试将其用于我的模型,将其使用独热编码转换为数字,然后将所有内容转换为标准化数据。但是,在我尝试独热编码之后(我不确定是否成功),我的make_column_transformer标准化处理过程失败了。

# model 4
# next model
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tensorflow.keras import layers
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

np.set_printoptions(precision=3, suppress=True)
btc_data = pd.read_csv(
    "/content/drive/MyDrive/Science Fair/output2.csv",
    names=["Time", "Open"])

X_btc = btc_data[["Time"]]
y_btc = btc_data["Open"]

enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)

X_btc = enc.transform(X_btc)

print(X_btc)

X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)

ct = make_column_transformer(
    (MinMaxScaler(), ["Time"])
)

ct.fit(X_train)
X_train_normal = ct.transform(X_train)
X_test_normal = ct.transform(X_test)

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

btc_model_4 = tf.keras.Sequential([
  layers.Dense(100, activation="relu"),
  layers.Dense(100, activation="relu"),
  layers.Dense(100, activation="relu"),
  layers.Dense(100, activation="relu"),
  layers.Dense(100, activation="relu"),
  layers.Dense(100, activation="relu"),
  layers.Dense(1, activation="linear")
])

btc_model_4.compile(loss = tf.losses.MeanSquaredError(),
                      optimizer = tf.optimizers.Adam())

history = btc_model_4.fit(X_train_normal, y_train, batch_size=8192, epochs=100, callbacks=[callback])

btc_model_4.evaluate(X_test_normal, y_test, batch_size=8192)

y_pred = btc_model_4.predict(X_test_normal)

btc_model_4.save("btc_model_4")
btc_model_4.save("btc_model_4.h5")

# plot model
def plot_evaluations(train_data=X_train_normal,
                     train_labels=y_train,
                     test_data=X_test_normal,
                     test_labels=y_test,
                     predictions=y_pred):
  print(test_data.shape)
  print(predictions.shape)

  plt.figure(figsize=(100, 15))
  plt.scatter(train_data, train_labels, c='b', label="Training")
  plt.scatter(test_data, test_labels, c='g', label="Testing")
  plt.scatter(test_data, predictions, c='r', label="Results")
  plt.legend()

plot_evaluations()

# plot loss curve
pd.DataFrame(history.history).plot()
plt.ylabel("loss")
plt.xlabel("epochs")

我的标准数据格式如下:

2015-12-05 12:52:00,377.48
2015-12-05 12:53:00,377.5
2015-12-05 12:54:00,377.5
2015-12-05 12:56:00,377.5
2015-12-05 12:57:00,377.5
2015-12-05 12:58:00,377.5
2015-12-05 12:59:00,377.5
2015-12-05 13:00:00,377.5
2015-12-05 13:01:00,377.79
2015-12-05 13:02:00,377.5
2015-12-05 13:03:00,377.79
2015-12-05 13:05:00,377.74
2015-12-05 13:06:00,377.79
2015-12-05 13:07:00,377.64
2015-12-05 13:08:00,377.79
2015-12-05 13:10:00,377.77
2015-12-05 13:11:00,377.7
2015-12-05 13:12:00,377.77
2015-12-05 13:13:00,377.77
2015-12-05 13:14:00,377.79
2015-12-05 13:15:00,377.72
2015-12-05 13:16:00,377.5
2015-12-05 13:17:00,377.49
2015-12-05 13:18:00,377.5
2015-12-05 13:19:00,377.5
2015-12-05 13:20:00,377.8
2015-12-05 13:21:00,377.84
2015-12-05 13:22:00,378.29
2015-12-05 13:23:00,378.3
2015-12-05 13:24:00,378.3
2015-12-05 13:25:00,378.33
2015-12-05 13:26:00,378.33
2015-12-05 13:28:00,378.31
2015-12-05 13:29:00,378.68

第一个值是日期,逗号后面的第二个值是当时比特币的价格。现在经过“独热编码”后,我加了一个打印语句来打印这些X值的值,得到以下结果:

  (0, 0)    1.0
  (1, 1)    1.0
  (2, 2)    1.0
  (3, 3)    1.0
  (4, 4)    1.0
  (5, 5)    1.0
  (6, 6)    1.0
  (7, 7)    1.0
  (8, 8)    1.0
  (9, 9)    1.0
  (10, 10)  1.0
  (11, 11)  1.0
  (12, 12)  1.0
  (13, 13)  1.0
  (14, 14)  1.0
  (15, 15)  1.0
  (16, 16)  1.0
  (17, 17)  1.0
  (18, 18)  1.0
  (19, 19)  1.0
  (20, 20)  1.0
  (21, 21)  1.0
  (22, 22)  1.0
  (23, 23)  1.0
  (24, 24)  1.0
  : :
  (2526096, 2526096)    1.0
  (2526097, 2526097)    1.0
  (2526098, 2526098)    1.0
  (2526099, 2526099)    1.0
  (2526100, 2526100)    1.0
  (2526101, 2526101)    1.0
  (2526102, 2526102)    1.0
  (2526103, 2526103)    1.0
  (2526104, 2526104)    1.0
  (2526105, 2526105)    1.0
  (2526106, 2526106)    1.0
  (2526107, 2526107)    1.0
  (2526108, 2526108)    1.0
  (2526109, 2526109)    1.0
  (2526110, 2526110)    1.0
  (2526111, 2526111)    1.0
  (2526112, 2526112)    1.0
  (2526113, 2526113)    1.0
  (2526114, 2526114)    1.0
  (2526115, 2526115)    1.0
  (2526116, 2526116)    1.0
  (2526117, 2526117)    1.0
  (2526118, 2526118)    1.0
  (2526119, 2526119)    1.0
  (2526120, 2526120)    1.0

规范化拟合后,我遇到了以下错误:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    408         try:
--> 409             all_columns = X.columns
    410         except AttributeError:

5 frames
AttributeError: columns not found

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in _get_column_indices(X, key)
    410         except AttributeError:
    411             raise ValueError(
--> 412                 "Specifying the columns using strings is only "
    413                 "supported for pandas DataFrames"
    414             )

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

我是否正确地进行了一位有效编码? 适当的方式是什么?我应该在规范化过程中直接实现一位有效编码器吗?

2个回答

4

使用OneHotEncoder并不是解决问题的好方法,更好的方法是从列中提取特征,例如年、月、日、小时、分钟等,并将这些列作为输入提供给模型。

btc_data['Year'] = btc_data['Date'].astype('datetime64[ns]').dt.year
btc_data['Month'] = btc_data['Date'].astype('datetime64[ns]').dt.month
btc_data['Day'] = btc_data['Date'].astype('datetime64[ns]').dt.day
    

问题出在oneHotEncoder上,它返回一个scipy稀疏矩阵并且将“Time”列删除了。为了解决这个问题,你需要将输出重新转换成pandas dataframe,并添加“Time”列。
enc = OneHotEncoder(handle_unknown="ignore")
enc.fit(X_btc)
X_btc = enc.transform(X_btc)
X_btc = pd.DataFrame(X_btc.todense())
X_btc["Time"] = btc_data["Time"]

解决内存问题的一种方法是:

  1. 使用相同的random_state生成两个索引,一个用于pandas数据框架,另一个用于scipy稀疏矩阵
X_train, X_test, y_train, y_test = train_test_split(X_btc, y_btc, test_size=0.2, random_state=62)
X_train_pd, X_test_pd, y_train_pd, y_test_pd = train_test_split(btc_data, y_btc, test_size=0.2, random_state=62)
  1. 使用pandas数据框进行 MinMaxScaler()
   ct = make_column_transformer((MinMaxScaler(), ["Time"]))
   ct.fit(X_train_pd)
   result_train = ct.transform(X_train_pd)
   result_test = ct.transform(X_test_pd)
  1. 在训练和测试阶段使用生成器(这将解决内存问题),并在生成器中包含经过缩放的时间。
def nn_batch_generator(X_data, y_data, scaled, batch_size):
   samples_per_epoch = X_data.shape[0]
   number_of_batches = samples_per_epoch / batch_size
   counter = 0
   index = np.arange(np.shape(y_data)[0])
   while True:
       index_batch = index[batch_size * counter:batch_size * (counter + 1)]
       scaled_array = scaled[index_batch]
       X_batch = X_data[index_batch, :].todense()
       y_batch = y_data.iloc[index_batch]
       counter += 1
       yield np.array(np.hstack((np.array(X_batch), scaled_array))), np.array(y_batch)
       if (counter > number_of_batches):
           counter = 0


def nn_batch_generator_test(X_data, scaled, batch_size):
   samples_per_epoch = X_data.shape[0]
   number_of_batches = samples_per_epoch / batch_size
   counter = 0
   index = np.arange(np.shape(X_data)[0])
   while True:
       index_batch = index[batch_size * counter:batch_size * (counter + 1)]
       scaled_array = scaled[index_batch]
       X_batch = X_data[index_batch, :].todense()
       counter += 1
       yield np.hstack((X_batch, scaled_array))
       if (counter > number_of_batches):
           counter = 0

最后拟合模型


history = btc_model_4.fit(nn_batch_generator(X_train, y_train, scaled=result_train, batch_size=2), steps_per_epoch=#Todetermine,
                         batch_size=2, epochs=10,
                         callbacks=[callback])

btc_model_4.evaluate(nn_batch_generator(X_test, y_test, scaled=result_test, batch_size=2), batch_size=2, steps=#Todetermine)
y_pred = btc_model_4.predict(nn_batch_generator_test(X_test, scaled=result_test, batch_size=2), steps=#Todetermine)


1
你的解决方案很有道理,但是在我使用的大数据集上,它占用内存太多了。是否有更好的方法来避免即使拥有24GB RAM也会崩溃的运行时问题呢? - Khosraw Azizi
你可以通过将“时间”列加载为时间戳而不是字符串来减少内存占用。尝试使用pd.read_csv(file, parse_dates=["Time"], names=['Time', 'Open'])进行操作。 - 0x26res

1

补充现有答案,如果你将Scipy Compressed Sparse Row (CSR)矩阵转换为Pandas DataFrame,并将时间戳字符串转换为datetime64,则模型将开始训练 - 至少在提供的小子集上:

    enc = OneHotEncoder(handle_unknown="ignore")
    enc.fit(X_btc)
    X_btc = enc.transform(X_btc)
    X_btc = pd.DataFrame(X_btc.todense())
    X_btc["Time"] = btc_data["Time"]
    X_btc['Time'] = X_btc['Time'].astype('datetime64[ns]')


根据您关于内存密集的评论,这是您解决问题的方式的本质——通过使用时间戳进行单热编码,如果您有一个功能矩阵,其中每个包含不同值的行(当处理时间戳时我们期望如此),应用单热编码将生成一个n x n的矩阵,可能非常巨大。要验证这一点,如果您在测试数据中逐步执行或打印出正在生成的中间矩阵,您将观察到X_btc开始是一个34 x 1的矩阵,在编码器(X_btc = enc.transform(X_btc))应用后变为34 x 34的矩阵。
我不确定这个问题的最终目标是什么,但如果您想继续使用这种方法,您可以以不那么精细的方式对样本进行分组,即在进行单热编码之前将时间戳截断到小时,并且再应用单热编码:
    X_btc['Time'] = X_btc['Time'].astype('datetime64[h]')  # convert to units to hours before one hot encoding
    enc = OneHotEncoder(handle_unknown="ignore")
    enc.fit(X_btc)
    X_btc = enc.transform(X_btc)
    X_btc = pd.DataFrame(X_btc.todense())
    X_btc["Time"] = btc_data["Time"].astype('datetime64[ns]')  # Use 'ns' here to retain the full timestamp information

在提供的示例数据中,由于有两个不同的小时数(12和13),当应用一次独热编码时,我们现在只有2个不同的类别,而不是34个。这应该可以缓解内存问题,因为与此数据的总记录相比,你应该拥有更少的小时数。
沿着同样的思路,你可以从时间戳中提取小时数(可能还包括分钟数)进行独热编码,而不是将其截断为仅小时数:
    X_btc['Time'] = str(X_btc['Time'].astype('datetime64[ns]').dt.hour) 
    #  + ":" + str(X_btc['Time'].astype('datetime64[ns]').dt.minute) # UNCOMMENT TO INCLUDE minute

这种方法的好处是,如果你保存了编码器,你可以在新的数据被摄入系统时重复使用这个逻辑,而在当前的编码训练数据的方法中,你将无法在不包含训练集中日期的数据流上运行模型。它们将属于一个新类别,并需要重新拟合编码器和模型。
如果只使用小时,那么你将有24个不同的分类来自独热编码器。如果你也使用分钟,那么你将有24*60=1440个不同的分类(仍然应该远少于你正在处理的记录数)。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接