TensorFlow - L2正则化如何应用于所有权重而不仅仅是最后一个?

68

我正在玩一个Udacity深度学习课程中的ANN。

我的任务是在使用L2损失的具有一个隐藏ReLU层的网络中引入泛化。我想知道如何正确地引入它,以便所有权重都会受到惩罚,而不仅仅是输出层的权重。

不包含泛化的网络代码在帖子底部(实际运行训练的代码超出了问题的范围)。

引入L2的明显方法是将损失计算替换为类似于以下内容的内容(如果beta为0.01):

loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(out_layer, tf_train_labels) + 0.01*tf.nn.l2_loss(out_weights))

但在这种情况下,它将考虑输出层权重的值。我不确定我们如何正确地惩罚进入隐藏的ReLU层的权重。这是否完全需要,或者引入对输出层的惩罚是否会以某种方式控制隐藏的权重?

#some importing
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
from six.moves import range

#loading data
pickle_file = '/home/maxkhk/Documents/Udacity/DeepLearningCourse/SourceCode/tensorflow/examples/udacity/notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)


#prepare data to have right format for tensorflow
#i.e. data is flat matrix, labels are onehot

image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)


#now is the interesting part - we are building a network with
#one hidden ReLU layer and out usual output linear layer

#we are going to use SGD so here is our size of batch
batch_size = 128

#building tensorflow graph
graph = tf.Graph()
with graph.as_default():
      # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  #now let's build our new hidden layer
  #that's how many hidden neurons we want
  num_hidden_neurons = 1024
  #its weights
  hidden_weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_neurons]))
  hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))

  #now the layer itself. It multiplies data by weights, adds biases
  #and takes ReLU over result
  hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)

  #time to go for output linear layer
  #out weights connect hidden neurons to output labels
  #biases are added to output labels  
  out_weights = tf.Variable(
    tf.truncated_normal([num_hidden_neurons, num_labels]))  

  out_biases = tf.Variable(tf.zeros([num_labels]))  

  #compute output  
  out_layer = tf.matmul(hidden_layer,out_weights) + out_biases
  #our real output is a softmax of prior result
  #and we also compute its cross-entropy to get our loss
  loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(out_layer, tf_train_labels))

  #now we just minimize this loss to actually train the network
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

  #nice, now let's calculate the predictions on each dataset for evaluating the
  #performance so far
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(out_layer)
  valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)
  valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) 

  test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)
  test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)

3
手动收集所有权重变量的替代方法是将它们添加到一个集合中,通常使用tf.GraphKeys.REGULARIZATION_LOSSES。请参见此问题以获取示例解决方案。 - bluenote10
3个回答

106

一个更简短且可扩展的方法是:

vars   = tf.trainable_variables() 
lossL2 = tf.add_n([ tf.nn.l2_loss(v) for v in vars ]) * 0.001

这基本上总结了所有可训练变量的l2_loss。您还可以创建一个字典,只指定要添加到成本中的变量,并使用上面的第二行。然后,您可以将lossL2与softmax交叉熵值一起添加,以计算您的总损失。

编辑:如Piotr Dabkowski所提到的那样,上面的代码也会正则化偏差。这可以通过在第二行中添加if语句来避免;

lossL2 = tf.add_n([ tf.nn.l2_loss(v) for v in vars
                    if 'bias' not in v.name ]) * 0.001

这可以用来排除其他变量。


6
请注意,针对列表推导式选择偏差值的操作取决于tf变量的实际名称,因此如果您没有将其命名为带有“bias”字样的内容,则该示例不会将其排除在外。 - stolsvik
当然可以!这就是为什么我指定了“这可以用来排除其他变量”的原因。指出这一点很好,谢谢。 - PhABC

62

hidden_weightshidden_biasesout_weightsout_biases 都是你正在创建的模型参数。你可以按照以下方式对所有这些参数进行 L2 正则化:

loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=out_layer, labels=tf_train_labels)) +
    0.01*tf.nn.l2_loss(hidden_weights) +
    0.01*tf.nn.l2_loss(hidden_biases) +
    0.01*tf.nn.l2_loss(out_weights) +
    0.01*tf.nn.l2_loss(out_biases))

在Keight Johnson的注释中,不要规范化偏差:

loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=out_layer, labels=tf_train_labels)) +
    0.01*tf.nn.l2_loss(hidden_weights) +
    0.01*tf.nn.l2_loss(out_weights) +

9
为什么我们要在偏置项上添加L2正则化呢?我认为并没有必要对偏置项进行L2正则化。 - GoingMyWay
69
应该只对权重进行正则化,而不是对偏差进行正则化。 - Keith Johnson
5
@AlexanderYau: 你是正确的:"...出于这些原因,我们通常不在正则化时包含偏差项"(参见此处)。 - johndodo
为什么要使用reduce_mean?l2_loss的输出不应该是标量吗? - Swair
1
你为什么不除以样本数量呢? - SpaceMonkey
@Keith Johnson,你能解释一下吗? - mrgloom

19

事实上,我们通常不对偏置项(截距)进行正则化。所以,我选择:

In fact, we usually do not regularize bias terms (intercepts). So, I go for:

loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    logits=out_layer, labels=tf_train_labels)) +
    0.01*tf.nn.l2_loss(hidden_weights) +
    0.01*tf.nn.l2_loss(out_weights))

惩罚截距项会导致在y值中添加截距,从而改变y值,将常数c加到截距中。有没有它都不会改变结果,但会增加一些计算量。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接