如何选择减少过拟合的策略？

Question

如何选择减少过拟合的策略？

pythontensorflowmachine-learningkerasdeep-learning

7

我正在使用Keras，在预训练的ResNet50网络上应用迁移学习。我有带二进制类标签的图像补丁，并希望使用CNN来预测未见过的图像补丁中[0；1]范围内的类标签。

网络：使用ImageNet预训练的ResNet50 ，加入了3层
数据：70305个训练样本，8000个验证样本，66823个测试样本，所有样本都有平衡数量的类标签
图像：3个波段（RGB）和224x224个像素
设置：32批次，卷积层大小：16
结果：几个epoch后，准确率已经接近1，损失接近0，而在验证数据上，准确率保持在0.5左右，损失每个epoch会有所变化。最终，CNN对所有未见过的补丁只预测一个类别。
问题：似乎我的网络出现了过拟合。

以下策略可以减少过拟合：

增加批次大小
减小完全连接层的大小
添加dropout层
添加数据增强
通过修改损失函数应用正则化方法
解冻更多的预训练层
使用不同的网络架构

我已经尝试了最大512个批次大小，并改变了完全连接层的大小，但并没有取得太大成功。在随机测试其余策略之前，我想询问如何调查出现问题的原因，以找出上述策略中哪一个具有最大潜力。

以下是我的代码:

def generate_data(imagePathTraining, imagesize, nBatches):
    datagen = ImageDataGenerator(rescale=1./255)
    generator = datagen.flow_from_directory\
        (directory=imagePathTraining,                           # path to the target directory
         target_size=(imagesize,imagesize),                     # dimensions to which all images found will be resize
         color_mode='rgb',                                      # whether the images will be converted to have 1, 3, or 4 channels
         classes=None,                                          # optional list of class subdirectories
         class_mode='categorical',                              # type of label arrays that are returned
         batch_size=nBatches,                                   # size of the batches of data
         shuffle=True)                                          # whether to shuffle the data
    return generator

def create_model(imagesize, nBands, nClasses):
    print("%s: Creating the model..." % datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
    # Create pre-trained base model
    basemodel = ResNet50(include_top=False,                     # exclude final pooling and fully connected layer in the original model
                         weights='imagenet',                    # pre-training on ImageNet
                         input_tensor=None,                     # optional tensor to use as image input for the model
                         input_shape=(imagesize,                # shape tuple
                                      imagesize,
                                      nBands),
                         pooling=None,                          # output of the model will be the 4D tensor output of the last convolutional layer
                         classes=nClasses)                      # number of classes to classify images into
    print("%s: Base model created with %i layers and %i parameters." %
          (datetime.now().strftime('%Y-%m-%d_%H-%M-%S'),
           len(basemodel.layers),
           basemodel.count_params()))

    # Create new untrained layers
    x = basemodel.output
    x = GlobalAveragePooling2D()(x)                             # global spatial average pooling layer
    x = Dense(16, activation='relu')(x)                         # fully-connected layer
    y = Dense(nClasses, activation='softmax')(x)                # logistic layer making sure that probabilities sum up to 1

    # Create model combining pre-trained base model and new untrained layers
    model = Model(inputs=basemodel.input,
                  outputs=y)
    print("%s: New model created with %i layers and %i parameters." %
          (datetime.now().strftime('%Y-%m-%d_%H-%M-%S'),
           len(model.layers),
           model.count_params()))

    # Freeze weights on pre-trained layers
    for layer in basemodel.layers:
        layer.trainable = False

    # Define learning optimizer
    optimizerSGD = optimizers.SGD(lr=0.01,                      # learning rate.
                                  momentum=0.0,                 # parameter that accelerates SGD in the relevant direction and dampens oscillations
                                  decay=0.0,                    # learning rate decay over each update
                                  nesterov=False)               # whether to apply Nesterov momentum

    # Compile model
    model.compile(optimizer=optimizerSGD,                       # stochastic gradient descent optimizer
                  loss='categorical_crossentropy',              # objective function
                  metrics=['accuracy'],                         # metrics to be evaluated by the model during training and testing
                  loss_weights=None,                            # scalar coefficients to weight the loss contributions of different model outputs
                  sample_weight_mode=None,                      # sample-wise weights
                  weighted_metrics=None,                        # metrics to be evaluated and weighted by sample_weight or class_weight during training and testing
                  target_tensors=None)                          # tensor model's target, which will be fed with the target data during training
    print("%s: Model compiled." % datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
    return model

def train_model(model, nBatches, nEpochs, imagePathTraining, imagesize, nSamples, valX,valY, resultPath):
    history = model.fit_generator(generator=generate_data(imagePathTraining, imagesize, nBatches),
                                  steps_per_epoch=nSamples//nBatches,     # total number of steps (batches of samples)
                                  epochs=nEpochs,               # number of epochs to train the model
                                  verbose=2,                    # verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch
                                  callbacks=None,               # keras.callbacks.Callback instances to apply during training
                                  validation_data=(valX,valY),  # generator or tuple on which to evaluate the loss and any model metrics at the end of each epoch
                                  class_weight=None,            # optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function
                                  max_queue_size=10,            # maximum size for the generator queue
                                  workers=32,                   # maximum number of processes to spin up when using process-based threading
                                  use_multiprocessing=True,     # whether to use process-based threading
                                  shuffle=True,                 # whether to shuffle the order of the batches at the beginning of each epoch
                                  initial_epoch=0)              # epoch at which to start training
    print("%s: Model trained." % datetime.now().strftime('%Y-%m-%d_%H-%M-%S')) 
    return history

- Sophie Crommelinck

3

你明显出现了过拟合。震荡可能意味着你使用的是随机梯度下降法时的学习率过高。我还发现你将衰减率设置为0。你可以尝试降低学习率并使用其他评估指标。 - Nihal Sangeeth

这两个类别的标签数量相同吗？ - Baschdl

@nihal：通过降低学习率，您是指将lr=0.01减少到接近零的值吗？您所说的不同度量是什么意思？“decay=0.0”是keras文档中的默认值。我想我需要更多了解损失函数参数的工作原理。 - Sophie Crommelinck

1

@baschdl：是的，这就是我所说的“...所有类别标签都有平衡数量”的意思。 - Sophie Crommelinck

1

是的，@nihal 的意思是要减少 ´lr´ 参数。目前，您正在使用分类交叉熵，可以使用另一个损失函数，但交叉熵应该没问题。 - Baschdl

2个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jon Nordby · Answer 1

0

这些结果似乎太糟糕了，以至于不可能是过拟合的情况。相反，我怀疑训练和验证数据之间存在差异。

我注意到您对训练数据使用了 ImageDataGenerator(rescale=1./255)，但是对于 valX，我没有看到任何这样的处理。我建议使用一个单独的 ImageDataGenerator，并使用相同的缩放配置来处理验证数据。这样差异就尽可能小了。

- Jon Nordby

我现在将验证数据作为训练数据生成，使用valGenerator = generate_data(imagePathValidation, imagesize, nBatches)，并相应地更新了fit_generator()中的参数：validation_data=valGenerator, validation_steps=valGenerator.samples//nBatches。此外，我按@nihal的建议降低了学习率。train_acc现在增长缓慢，但val_acc保持在0.5。train_loss从0.7线性下降到0.5，而val_loss在1和3之间波动。这算是过拟合吗？也许我需要想办法使损失函数呈凸曲线形状。 - Sophie Crommelinck

听起来很奇怪。你应该分析几个批次，并检查标签和数据看起来是否合理。 - Jon Nordby

你应该尝试用一个从头开始训练的小型CNN模型来替换预训练模型。这样，你就可以知道在你的设置下是否可能学习。 - Jon Nordby

此外，尝试减少每个时期的样本数量，以免成为整个数据集。这可以使开发过程更容易看到。由于有这么多图像，在第一个时期完成之前会发生很多事情。 - Jon Nordby

我现在明白了为什么我的 val_acc 始终保持在0.5（因为我将图像按255进行了重新缩放，如我在这里所解释的那样）。这与修改后的学习优化器设置相结合，有助于减少损失振荡并实现损失和准确度的指数曲线形状。一旦我找到了适合我的数据的最佳超参数，我会发布一个答案。 - Sophie Crommelinck

是的，preprocessing_function 可能是最关键的一个。 - Jon Nordby

- Sophie Crommelinck · Answer 2

基于以上建议，我进行了以下修改：

我修改了学习优化器（将学习率降低到0.001并使其适应衰减）
我统一了数据生成器（训练和验证使用相同的ImageDataGenerator）
我使用了不同的预训练基础CNN（VGG19代替ResNet50）
我增加了可训练全连接层中的节点数（从16个增加到1024个），这提高了最终的验证准确性
我增加了丢失率（从0.5增加到0.8），这最小化了训练和验证准确性之间的差距，从而限制了过度拟合

    def generate_data(path, imagesize, nBatches):
        datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
        generator = datagen.flow_from_directory(directory=path,     # path to the target directory
             target_size=(imagesize,imagesize),                     # dimensions to which all images found will be resize
             color_mode='rgb',                                      # whether the images will be converted to have 1, 3, or 4 channels
             classes=None,                                          # optional list of class subdirectories
             class_mode='categorical',                              # type of label arrays that are returned
             batch_size=nBatches,                                   # size of the batches of data
             shuffle=True,                                          # whether to shuffle the data
             seed=42)                                               # random seed for shuffling and transformations
        return generator

    def create_model(imagesize, nBands, nClasses):
        # Create pre-trained base model
        basemodel = VGG19(include_top=False,                        # exclude final pooling and fully connected layer in the original model
                             weights='imagenet',                    # pre-training on ImageNet
                             input_tensor=None,                     # optional tensor to use as image input for the model
                             input_shape=(imagesize,                # shape tuple
                                          imagesize,
                                          nBands),
                             pooling=None,                          # output of the model will be the 4D tensor output of the last convolutional layer
                             classes=nClasses)                      # number of classes to classify images into

        # Freeze weights on pre-trained layers
        for layer in basemodel.layers:
            layer.trainable = False   

        # Create new untrained layers
        x = basemodel.output
        x = GlobalAveragePooling2D()(x)                             # global spatial average pooling layer
        x = Dense(1024, activation='relu')(x)                       # fully-connected layer
        x = Dropout(rate=0.8)(x)                                    # dropout layer
        y = Dense(nClasses, activation='softmax')(x)                # logistic layer making sure that probabilities sum up to 1

        # Create model combining pre-trained base model and new untrained layers
        model = Model(inputs=basemodel.input,
                      outputs=y)

        # Define learning optimizer
        optimizerSGD = optimizers.SGD(lr=0.001,                     # learning rate.
                                      momentum=0.9,                 # parameter that accelerates SGD in the relevant direction and dampens oscillations
                                      decay=learningRate/nEpochs,   # learning rate decay over each update
                                      nesterov=True)                # whether to apply Nesterov momentum
        # Compile model
        model.compile(optimizer=optimizerSGD,                       # stochastic gradient descent optimizer
                      loss='categorical_crossentropy',              # objective function
                      metrics=['accuracy'],                         # metrics to be evaluated by the model during training and testing
                      loss_weights=None,                            # scalar coefficients to weight the loss contributions of different model outputs
                      sample_weight_mode=None,                      # sample-wise weights
                      weighted_metrics=None,                        # metrics to be evaluated and weighted by sample_weight or class_weight during training and testing
                      target_tensors=None)                          # tensor model's target, which will be fed with the target data during training
        return model

    def train_model(model, nBatches, nEpochs, trainGenerator, valGenerator, resultPath):
        history = model.fit_generator(generator=trainGenerator,
                                      steps_per_epoch=trainGenerator.samples // nBatches,   # total number of steps (batches of samples)
                                      epochs=nEpochs,               # number of epochs to train the model
                                      verbose=2,                    # verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch
                                      callbacks=None,               # keras.callbacks.Callback instances to apply during training
                                      validation_data=valGenerator, # generator or tuple on which to evaluate the loss and any model metrics at the end of each epoch
                                      validation_steps=
                                      valGenerator.samples // nBatches,                     # number of steps (batches of samples) to yield from validation_data generator before stopping at the end of every epoch
                                      class_weight=None,            # optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function
                                      max_queue_size=10,            # maximum size for the generator queue
                                      workers=1,                    # maximum number of processes to spin up when using process-based threading
                                      use_multiprocessing=False,    # whether to use process-based threading
                                      shuffle=True,                 # whether to shuffle the order of the batches at the beginning of each epoch
                                      initial_epoch=0)              # epoch at which to start training

        return history, model

通过这些修改，我在训练100个epochs后，使用批量大小为32，实现了以下指标：

train_acc: 0.831
train_loss: 0.436
val_acc: 0.692
val_loss: 0.568

我认为这些设置是最优的，因为：

准确率和损失曲线在训练和验证中表现相似
train_acc仅在30个epochs之后超过val_acc
过度拟合最小（train_acc和val_acc之间差异很小）
train_loss和val_loss持续下降

然而，我想知道：

如果我应该通过更多的epochs来提高val_acc，那么这将会付出更多过拟合的代价。
为什么使用sklearn.metrics classification_report()方法计算f1-score，precision和recall时，在predict_generator()上得到的预测结果分数都在0.5左右，这表明在进行2类分类时没有学习到有效的结果。

也许我应该针对这些问题开一个新的讨论。