如何将tf.data.Dataset划分为x_train、y_train、x_test和y_test以用于Keras

Question

如何将tf.data.Dataset划分为x_train、y_train、x_test和y_test以用于Keras

pythonimagetensorflowmachine-learningkeras-2

5

如果我有一个数据集

dataset = tf.keras.preprocessing.image_dataset_from_directory(
    directory,
    labels="inferred",
    label_mode="int",
    class_names=None,
    color_mode="rgb",
    batch_size=32,
    image_size=(32, 32),
    shuffle=True,
    seed=None,
    validation_split=None,
    subset=None,
    interpolation="bilinear",
    follow_links=False,
)

我该如何将这个分成x和y数组呢？其中x数组将是IMG数组，而y数组将包含每个img的类别。

- sameerp815

请参见修改后答案底部添加的代码 - Gerry P

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gerry P · Accepted Answer

这将为您进行分离。您需要做的是创建一个目录，我们称之为c：\ train。现在，在该目录中，您需要创建一系列子目录，每个类别一个子目录。例如，如果您有狗和猫的图像，并且想要构建一个分类器来区分图像是猫还是狗，则在train目录中创建两个子目录。将一个目录命名为cats，将另一个子目录命名为dogs。现在将所有猫的图像放入cat子目录中，将所有狗的图像放入dog子目录中。现在假设您想使用75％的图像进行训练，使用25％的图像进行验证。现在使用下面的代码创建一个训练集和一个验证集。

train_batch_size = 50  # Set the training batch size you desire
valid_batch_size = 50  # Set this so that .25 X total sample/valid_batch_size is an integer
dir = r'c:\train'
img_size = 224  # Set this to the desired image size you want to use
train_set = tf.keras.preprocessing.image_dataset_from_directory(
    directory=dir, labels='inferred', label_mode='categorical', class_names=None,
    color_mode='rgb', batch_size=train_batch_size, image_size=(img_size, img_size),
    shuffle=True, seed=None, validation_split=.25, subset="training",
    interpolation='nearest', follow_links=False)
valid_set = tf.keras.preprocessing.image_dataset_from_directory(
    directory=dir, labels='inferred', label_mode='categorical', class_names=None,
    color_mode='rgb', batch_size=valid_batch_size, image_size=(img_size, img_size),
    shuffle=False, seed=None, validation_split=.25, subset="validation",
    interpolation='nearest', follow_links=False)

使用labels='inferred'时，标签将是子目录的名称。在这个例子中，它们将是cats和dogs。使用label_mode='categorical'时，标签数据是独热向量，所以当您编译模型时，将损失设置为'CategoricalCrossentropy'。请注意，在训练集中，shuffle设置为true，而在验证集中，shuffle设置为false。构建模型时，顶层应该有2个节点，并且激活函数应该是softmax。当您使用model.fit来训练模型时，最好每个epoch遍历一次验证集。因此，假设在dog-cat示例中，您有1000张狗图片和1000张猫图片，总共2000张。75% = 1500将用于训练，500将用于验证。如果您设置valid_batch_size=50，则需要10步才能遍历所有验证图像一次每个epoch。同样，如果train_batch_size=50，则需要30步才能遍历训练集。运行model.fit时，请将steps_per_epoch设置为30，validation_steps设置为10。实际上，我更喜欢使用tf.keras.preprocessing.image.ImageDataGenerator生成数据集。它类似但更加灵活。文档在这里。如果您愿意，它允许您指定一个预处理函数，并允许您重新缩放图像值。通常，您希望使用1/255作为重新缩放值。

如果您只想拆分训练数据，可以使用sklearn中的train_test_split。文档在这里。下面的代码展示了如何将其分为训练集、验证集和测试集。假设您想要80%的数据用于训练，10%用于验证，10%用于测试。假设X是图像的np数组，y是相关标签的数组。下面的代码显示了拆分过程。

from sklearn.model_selection import train_test_split
X_train, X_tv, y_train, y_tv = train_test_split( X, y, train_size=0.8, random_state=42)
X_test, X_valid, y_test, y_valid=train_test_split(X_tv,y_tv, train_size=.5, randon_state=20)