Adam优化器的权重衰减应该如何正确实施?

26

由于Adam优化器会为梯度保留一对运行平均数,例如均值/方差,因此我想知道它应该如何正确处理权重衰减。我看到了两种实现方式。

  1. 仅根据目标损失从梯度中更新均值/方差,每个小批量明确地衰减权重。(以下代码摘自https://github.com/dmlc/mxnet/blob/v0.7.0/python/mxnet/optimizer.py)

    weight[:] -= lr*mean/(sqrt(variance) + self.epsilon)
    
    wd = self._get_wd(index)
    if wd > 0.:
        weight[:] -= (lr * wd) * weight
    
  2. 根据目标损失和正则化损失的梯度更新均值/方差,并像通常一样更新权重。(以下代码取自https://github.com/dmlc/mxnet/blob/master/src/operator/optimizer_op-inl.h#L210)

  3. grad = scalar<DType>(param.rescale_grad) * grad +
    scalar<DType>(param.wd) * weight;
    // stuff
    Assign(out, req[0],
       weight -
       scalar<DType>(param.lr) * mean /
       (F<square_root>(var) + scalar<DType>(param.epsilon)));
    

这两种方法有时在训练结果上会有显著的差异。实际上,我认为第一种方法更有意义(并且发现它有时会给出更好的结果)。Caffe和旧版mxnet采用第一种方法,而torch、tensorflow和新版mxnet采用第二种方法。

非常感谢您的帮助!


这两者之间的差异对于低位宽度训练来说非常巨大,猜测权重正则化在这种情况下可能会产生负面影响。(这也可能适用于其他类似情况) - Kato
你确定tensorflow支持AdamOptimizer的权重衰减吗?我刚刚查看了代码,没有看到任何关于权重衰减的内容。https://github.com/tensorflow/tensorflow/blob/9bdb72e124e50e1b12b3286b38cbb1c971552741/tensorflow/core/kernels/training_ops.cc#L284 - iron9light
2个回答

25

编辑: 请参考此PR,它已合并到TF中。

当使用纯SGD(没有动量)作为优化器时,权重衰减与将L2正则化项添加到损失相同。 当使用任何其他优化器时,情况并非如此。

权重衰减(不知道如何在此处使用TeX,因此请原谅我的伪符号表示):

w[t+1] = w[t] - learning_rate * dw - weight_decay * w

L2正则化:

loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)

计算L2正则化中额外项的梯度得到lambda * w,因此将其插入SGD更新方程。

dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw

提供与权重衰减相同的结果,但将lambdalearning_rate混合使用。任何其他优化器,即使是具有动量的SGD,对于权重衰减和L2正则化的更新规则也不同!有关更多详细信息,请参阅Fixing weight decay in Adam论文。(编辑:据我所知,this 1987 Hinton paper在第10页介绍了“权重衰减”,字面意思是“每次更新权重时,它们的大小也会减少0.4%”)

话虽如此,TensorFlow似乎还没有支持“适当”的权重衰减。有一些问题正在讨论它,具体是因为上述论文。

一种可能的实现方法是编写一个操作,在每个优化器步骤后手动执行衰减步骤。另一种方式是使用额外的SGD优化器来进行权重衰减,并将其“附加”到您的train_op。这两种方法都只是粗略的解决方法。我的当前代码:

# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
               weights_regularizer=layers.l2_regularizer(weight_decay)):
    # define the network.

loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
    with tf.control_dependencies([train_op]):
        sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))

这在某种程度上利用了TensorFlow提供的簿记功能。请注意,arg_scope负责将每个层的L2正则化项附加到REGULARIZATION_LOSSES图键中,然后我将它们全部加起来并使用SGD进行优化,如上所示,对应于实际的权重衰减。
希望这有所帮助,如果有人能提供更好的代码片段,或者TensorFlow能够更好地实现它(即在优化器中),请分享。

我完全同意你的观点,即添加动量或使用自适应优化器意味着有效的权重衰减项与纯SGD不同。然而,我对术语“weight-decay”的理解一直(似乎是错误的)认为它只是从业者给L2正则化取的名字,因为当你为SGD实现L2时,看起来就像是指数衰减权重。TF和MXNet的实现似乎也符合我的理解。但正如你指出的那样,“weight-decay”似乎是一种独立的正则化技术。 - Sina Afrooze
我必须补充一点,基于1988年的论文比较网络偏差(又称正则化器),权重衰减被认为是一种“临时”的提高训练泛化能力的方法,并且在该论文中表明它等同于二次偏差,即使用纯SGD的L2正则化。鉴于自适应优化器是最近才发明的,是否可以认为权重衰减确实是L2正则化,而MXNet和TF中的实现是正确的呢? - Sina Afrooze
在这种情况下,也许我们需要为AdamW论文中提出的内容取一个不同的名称 :) - Sina Afrooze
我不同意,我们已经有两个名称(权重衰减和L2正则化)用于两种不同的技术,只有在一个特殊情况下才会重合。不幸的是,许多学者已经混淆了它们。我们可以回到Hinton 1987年的论文,据我所知,他首次引入了权重衰减,字面上是“每次更新权重时,它们的大小也会减少0.4%”(第10页)。 - LucasB
我相信你是对的。你在TF上提交了问题吗?MXNet上有这个问题。这个问题 - Sina Afrooze
感谢您的同意 :) 上次我看到TF存储库中有一些关于它的问题,但现在再看,似乎这个PR两天前刚合并了!我认为这是一个不必要的复杂问题,但是,那就是TF的趋势,所以就这样吧。 - LucasB

5

我遇到了同样的问题。我认为这段代码来自这里,可以解决你的问题。它通过从tf.train.Optimizer继承实现了权重衰减Adam优化器。这是我找到的最干净的解决方案:

class AdamWeightDecayOptimizer(tf.train.Optimizer):
"""A basic Adam optimizer that includes "correct" L2 weight decay."""

def __init__(self,
             learning_rate,
             weight_decay_rate=0.0,
             beta_1=0.9,
             beta_2=0.999,
             epsilon=1e-6,
             exclude_from_weight_decay=None,
             name="AdamWeightDecayOptimizer"):
  """Constructs a AdamWeightDecayOptimizer."""
  super(AdamWeightDecayOptimizer, self).__init__(False, name)

  self.learning_rate = learning_rate
  self.weight_decay_rate = weight_decay_rate
  self.beta_1 = beta_1
  self.beta_2 = beta_2
  self.epsilon = epsilon
  self.exclude_from_weight_decay = exclude_from_weight_decay

def apply_gradients(self, grads_and_vars, global_step=None, name=None):
  """See base class."""
  assignments = []
  for (grad, param) in grads_and_vars:
    if grad is None or param is None:
      continue

    param_name = self._get_variable_name(param.name)

    m = tf.get_variable(
        name=param_name + "/adam_m",
        shape=param.shape.as_list(),
        dtype=tf.float32,
        trainable=False,
        initializer=tf.zeros_initializer())
    v = tf.get_variable(
        name=param_name + "/adam_v",
        shape=param.shape.as_list(),
        dtype=tf.float32,
        trainable=False,
        initializer=tf.zeros_initializer())

    # Standard Adam update.
    next_m = (
        tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
    next_v = (
        tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
                                                  tf.square(grad)))

    update = next_m / (tf.sqrt(next_v) + self.epsilon)

    # Just adding the square of the weights to the loss function is *not*
    # the correct way of using L2 regularization/weight decay with Adam,
    # since that will interact with the m and v parameters in strange ways.
    #
    # Instead we want ot decay the weights in a manner that doesn't interact
    # with the m/v parameters. This is equivalent to adding the square
    # of the weights to the loss with plain (non-momentum) SGD.
    if self._do_use_weight_decay(param_name):
      update += self.weight_decay_rate * param

    update_with_lr = self.learning_rate * update

    next_param = param - update_with_lr

    assignments.extend(
        [param.assign(next_param),
         m.assign(next_m),
         v.assign(next_v)])
  return tf.group(*assignments, name=name)

def _do_use_weight_decay(self, param_name):
  """Whether to use L2 weight decay for `param_name`."""
  if not self.weight_decay_rate:
    return False
  if self.exclude_from_weight_decay:
    for r in self.exclude_from_weight_decay:
      if re.search(r, param_name) is not None:
        return False
  return True

def _get_variable_name(self, param_name):
  """Get the variable name from the tensor name."""
  m = re.match("^(.*):\\d+$", param_name)
  if m is not None:
    param_name = m.group(1)
  return param_name

你可以按照以下方式使用(我做了一些修改,以便在更普遍的情况下使用),此函数将返回一个train_op,可在会话中使用:

def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps):
  """Creates an optimizer training op."""
  global_step = tf.train.get_or_create_global_step()

  learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)

  # Implements linear decay of the learning rate.
  learning_rate = tf.train.polynomial_decay(
      learning_rate,
      global_step,
      num_train_steps,
      end_learning_rate=0.0,
      power=1.0,
      cycle=False)

  # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
  # learning rate will be `global_step/num_warmup_steps * init_lr`.
  if num_warmup_steps:
    global_steps_int = tf.cast(global_step, tf.int32)
    warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)

    global_steps_float = tf.cast(global_steps_int, tf.float32)
    warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)

    warmup_percent_done = global_steps_float / warmup_steps_float
    warmup_learning_rate = init_lr * warmup_percent_done

    is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
    learning_rate = (
        (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)

  # It is recommended that you use this optimizer for fine tuning, since this
  # is how the model was trained (note that the Adam m/v variables are NOT
  # loaded from init_checkpoint.)
  optimizer = AdamWeightDecayOptimizer(
      learning_rate=learning_rate,
      weight_decay_rate=0.01,
      beta_1=0.9,
      beta_2=0.999,
      epsilon=1e-6)


  tvars = tf.trainable_variables()
  grads = tf.gradients(loss, tvars)

  # You can do clip gradients if you need in this step(in general it is not neccessary)
  # (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)

  train_op = optimizer.apply_gradients(
      zip(grads, tvars), global_step=global_step)

  # Normally the global step update is done inside of `apply_gradients`.
  # However, `AdamWeightDecayOptimizer` doesn't do this. But if you use
  # a different optimizer, you should probably take this line out.
  new_global_step = global_step + 1
  train_op = tf.group(train_op, [global_step.assign(new_global_step)])
  return train_op

我见过的最干净的实现! - Raghotham S
请注意,此代码仅与TF1兼容(例如tf.get_variable()方法)。我建议将代码更新为TF2或使用TensorFlow-Addons(tfa),其中已实现了tfa.optimizers.AdamW - Triceratops

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接