使用Keras进行不同批次大小的损失计算

Question

使用Keras进行不同批次大小的损失计算

7

我知道理论上，一个批次内的网络损失仅是每个单独损失的总和。这体现在Keras代码中计算总损失的部分。相关内容如下：

            for i in range(len(self.outputs)):
            if i in skip_target_indices:
                continue
            y_true = self.targets[i]
            y_pred = self.outputs[i]
            weighted_loss = weighted_losses[i]
            sample_weight = sample_weights[i]
            mask = masks[i]
            loss_weight = loss_weights_list[i]
            with K.name_scope(self.output_names[i] + '_loss'):
                output_loss = weighted_loss(y_true, y_pred,
                                            sample_weight, mask)
            if len(self.outputs) > 1:
                self.metrics_tensors.append(output_loss)
                self.metrics_names.append(self.output_names[i] + '_loss')
            if total_loss is None:
                total_loss = loss_weight * output_loss
            else:
                total_loss += loss_weight * output_loss

然而，我发现当我使用batch_size=32和batch_size=64训练神经网络时，每个epoch的损失值仍然基本相同，只有~0.05%的差别。但是，这两个网络的准确性仍然完全相同。因此，批量大小对网络影响不大。

我的问题是，如果批量大小加倍，假设损失实际上正在被求和，那么损失是否应该是之前的两倍甚至更大？通过大批量大小可能会使网络学习得更好的说法被事实所否定，因为准确性保持不变。

无论批量大小如何，损失值基本保持不变，这让我认为它是在平均计算。

- Jonathan

4

损失是平均值，而不是个体损失的总和。 - enumaris

你能否通过代码确认一下这个吗？ - Jonathan

@enumaris 当我遵循 fit() 的代码时，它似乎是取平均值，但 compile() 似乎是求和。为什么会有两个函数？ - Jonathan

2

请查看此处：https://github.com/keras-team/keras/blob/master/keras/losses.py 所有的损失函数都被包裹在K.mean()中，这表明它们是平均值而不是总和。 - enumaris

@enumaris请查看已接受答案的评论。 - Jonathan

我的理解可能有误。我得稍后再看一下，因为现在没时间。 - enumaris

2个回答

3

我将总结这个页面中的精彩答案。

Certainly a model need a scalar value to optimize(i.e. Gradient Decent).
This important value is calculated on batch level.(if you set batch size=1, it is stochastic gradient descent mode. so the gradient is calculated on that data point)
In loss function, group aggregation function such as k.mean(), is specially activited on problems such as multi-classification, where to get one datapoint loss, we need sum many scalars along many labels.
In the loss history printed by model.fit, the loss value printed is a running average on each batch. So the value we see is actually a estimated loss scaled for batch_size*per datapoint.
Be aware that even if we set batch size=1, the printed history may use a different batch interval for print. In my case:
```
self.model.fit(x=np.array(single_day_piece),y=np.array(single_day_reward),batch_size=1)
```

打印结果为：

 1/24 [>.............................] - ETA: 0s - loss: 4.1276
 5/24 [=====>........................] - ETA: 0s - loss: -2.0592
 9/24 [==========>...................] - ETA: 0s - loss: -2.6107
13/24 [===============>..............] - ETA: 0s - loss: -0.4840
17/24 [====================>.........] - ETA: 0s - loss: -1.8741
21/24 [=========================>....] - ETA: 0s - loss: -2.4558
24/24 [==============================] - 0s 16ms/step - loss: -2.1474

在我的问题中，一个单独的数据点损失不可能达到4.xxx的规模。所以我猜测模型对前4个数据点进行了总和损失计算。然而，训练时的批量大小并不是4。

- ZhaoPan Song

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- today · Accepted Answer

您发布的代码涉及多输出模型，其中每个输出可能具有自己的损失和权重。因此，不同输出层的损失值将被相加。但是，批次中的各个损失会被平均。例如，您可以在losses.py文件中看到与二元交叉熵损失相关的代码。

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)

更新: 在添加本答案的第二部分（即损失函数）后，作为提问者的我被损失函数中的 axis=-1 所困惑，并且我想这是不是应该是 axis=0 表示对批次的平均值？然后我意识到，在损失函数的定义中使用所有的 K.mean() 是针对输出层包含多个单元的情况。那么损失在批次上求平均值在哪里？我检查了代码以找到答案：要获取特定损失函数的损失值，需要调用一个函数，该函数将真实标签、预测标签以及样本权重和掩码作为输入。

weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)

这个 weighted_losses[i] 函数是什么？你可能会发现，它是（增强的）损失函数列表中的一个元素。

weighted_losses = [
    weighted_masked_objective(fn) for fn in loss_functions]

fn 实际上是在 losses.py 文件中定义的损失函数之一，或者它可能是用户定义的自定义损失函数。那么现在这个 weighted_masked_objective 函数是什么呢？它在 training_utils.py 文件中被定义：

def weighted_masked_objective(fn):
    """Adds support for masking and sample-weighting to an objective function.
    It transforms an objective function `fn(y_true, y_pred)`
    into a sample-weighted, cost-masked objective function
    `fn(y_true, y_pred, weights, mask)`.
    # Arguments
        fn: The objective function to wrap,
            with signature `fn(y_true, y_pred)`.
    # Returns
        A function with signature `fn(y_true, y_pred, weights, mask)`.
    """
    if fn is None:
        return None

    def weighted(y_true, y_pred, weights, mask=None):
        """Wrapper function.
        # Arguments
            y_true: `y_true` argument of `fn`.
            y_pred: `y_pred` argument of `fn`.
            weights: Weights tensor.
            mask: Mask tensor.
        # Returns
            Scalar tensor.
        """
        # score_array has ndim >= 2
        score_array = fn(y_true, y_pred)
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in Theano
            mask = K.cast(mask, K.floatx())
            # mask should have the same shape as score_array
            score_array *= mask
            #  the loss per batch should be proportional
            #  to the number of unmasked samples.
            score_array /= K.mean(mask)

        # apply sample weighting
        if weights is not None:
            # reduce score_array to same ndim as weight array
            ndim = K.ndim(score_array)
            weight_ndim = K.ndim(weights)
            score_array = K.mean(score_array,
                                 axis=list(range(weight_ndim, ndim)))
            score_array *= weights
            score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
        return K.mean(score_array)
return weighted

正如您所看到的，首先在score_array = fn(y_true, y_pred)行计算每个样本的损失，然后最终返回损失的平均值，即return K.mean(score_array)。这证实了报告的损失是每个批次中每个样本损失的平均值。

请注意，K.mean() 在使用 Tensorflow 作为后端时，会调用 tf.reduce_mean() 函数。现在，当没有传入 axis 参数调用 K.mean()（axis 参数的默认值为 None），就像在 weighted_masked_objective 函数中一样，相应的对 tf.reduce_mean() 的调用计算所有轴上的平均值并返回单个值。这就是为什么无论输出层的形状和使用的损失函数如何，Keras 只使用并报告一个单一的损失值的原因（这应该是这样的，因为优化算法需要最小化一个标量值，而不是向量或张量）。