Tensorflow:在GPU上运行训练阶段,在CPU上运行测试阶段

3

我希望在GPU上运行我的tensorflow代码的训练阶段,然后在完成并保存结果后,将创建的模型加载到CPU上运行测试阶段。

我已经创建了这个代码(只放了一部分作为参考,因为整个代码太大了,我知道规则是要包含一个完整的可运行的代码,对此我感到很抱歉)。

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.contrib.rnn.python.ops import rnn_cell, rnn

# Import MNIST data http://yann.lecun.com/exdb/mnist/
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
x_train = mnist.train.images 
# Check that the dataset contains 55,000 rows and 784 columns
N,D = x_train.shape

tf.reset_default_graph()
sess = tf.InteractiveSession()

x = tf.placeholder("float", [None, n_steps,n_input]) 
y_true = tf.placeholder("float", [None, n_classes]) 
keep_prob = tf.placeholder(tf.float32,shape=[])
learning_rate = tf.placeholder(tf.float32,shape=[]) 

#[............Build the RNN graph model.............]

sess.run(tf.global_variables_initializer())
# Because I am using my GPU for the training, I avoid allocating the whole 
# mnist.validation set because of memory error, so I gragment it to 
# small batches (100)
x_validation_bin, y_validation_bin = mnist.validation.next_batch(batch_size)
x_validation_bin = binarize(x_validation_bin, threshold=0.1)
x_validation_bin = x_validation_bin.reshape((-1,n_steps,n_input))

for k in range(epochs):

    steps = 0

    for i in range(training_iters):
        #Stochastic descent
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        batch_x = binarize(batch_x, threshold=0.1)
        batch_x = batch_x.reshape((-1,n_steps,n_input))
        sess.run(train_step, feed_dict={x: batch_x, y_true: batch_y,keep_prob: keep_prob,eta:learning_rate})

        if do_report_err == 1:
            if steps % display_step == 0:
                # Calculate batch accuracy
                acc = sess.run(accuracy, feed_dict={x: batch_x, y_true: batch_y,keep_prob: 1.0})
                # Calculate batch loss
                loss = sess.run(total_loss, feed_dict={x: batch_x, y_true: batch_y,keep_prob: 1.0})
                print("Iter " + str(i) + ", Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy = " + "{:.5f}".format(acc))
        steps += 1




    # Validation Accuracy and Cost
    validation_accuracy = sess.run(accuracy,feed_dict={x:x_validation_bin, y_true:y_validation_bin, keep_prob:1.0})
    validation_cost = sess.run(total_loss,feed_dict={x:x_validation_bin, y_true:y_validation_bin, keep_prob:1.0})

    validation_loss_array.append(final_validation_cost)
    validation_accuracy_array.append(final_validation_accuracy)
    saver.save(sess, savefilename)
    total_epochs = total_epochs + 1

    np.savez(datasavefilename,epochs_saved = total_epochs,learning_rate_saved = learning_rate,keep_prob_saved = best_keep_prob, validation_loss_array_saved = validation_loss_array,validation_accuracy_array_saved = validation_accuracy_array,modelsavefilename = savefilename)

在此之后,我的模型已经成功训练并保存了相关数据,因此我希望加载文件,并使用CPU进行最后的训练和测试。原因是GPU无法处理整个mnist.train.images和mnist.train.labels数据集。

因此,我手动选择这一部分并运行它:

with tf.device('/cpu:0'):
# Initialise variables
    sess.run(tf.global_variables_initializer())

    # Accuracy and Cost
    saver.restore(sess, savefilename)
    x_train_bin = binarize(mnist.train.images, threshold=0.1)
    x_train_bin = x_train_bin.reshape((-1,n_steps,n_input))
    final_train_accuracy = sess.run(accuracy,feed_dict={x:x_train_bin, y_true:mnist.train.labels, keep_prob:1.0})
    final_train_cost = sess.run(total_loss,feed_dict={x:x_train_bin, y_true:mnist.train.labels, keep_prob:1.0})

    x_test_bin = binarize(mnist.test.images, threshold=0.1)
    x_test_bin = x_test_bin.reshape((-1,n_steps,n_input))
    final_test_accuracy = sess.run(accuracy,feed_dict={x:x_test_bin, y_true:mnist.test.labels, keep_prob:1.0})
    final_test_cost = sess.run(total_loss,feed_dict={x:x_test_bin, y_true:mnist.test.labels, keep_prob:1.0})

但是我遇到了OMM GPU内存错误,这对我来说没有意义,因为我认为我已经强制程序仅使用CPU。在第一个(批量训练)代码中,我没有放置sess.close()命令,但我不确定这是否真的是原因。实际上,我遵循了这篇文章用于CPU。有什么建议如何仅在CPU上运行最后一部分吗?

1个回答

4

with tf.device()语句仅适用于图形构建,而不适用于执行,因此在设备块内进行sess.run等同于根本没有设备。

要实现您想要的操作,您需要构建单独的训练和测试图形,这些图形共享变量。


谢谢,非常有趣。您的意思是通过with tf.device('gpu:0')'gpu:1'构建训练图,然后将tf.Session()放在外面进行评估?这在TF2中仍然有效吗?我在推理期间遇到了一些batch_norm/dropout问题。 - Zézouille

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接