如何确保Keras使用tensorflow后端的GPU?

6

我在Paperspace的云基础设施上创建了一个虚拟笔记本,使用Tensorflow GPU P5000虚拟实例作为后端。但是,在开始训练网络时,它的速度比我的MacBook Pro上纯CPU运行时引擎慢了2倍。如何确保Keras NN在训练过程中使用GPU而不是CPU?

请查看下面的代码:

from tensorflow.contrib.keras.api.keras.models import Sequential
from tensorflow.contrib.keras.api.keras.layers import Dense
from tensorflow.contrib.keras.api.keras.layers import Dropout
from tensorflow.contrib.keras.api.keras import utils as np_utils
import numpy as np
import pandas as pd

# Read data
pddata= pd.read_csv('data/data.csv', delimiter=';')

# Helper function (prepare & test data)
def split_to_train_test (data):
    trainLenght = len(data) - len(data)//10

    trainData = data.loc[:trainLenght].sample(frac=1).reset_index(drop=True)
    testData = data.loc[trainLenght+1:].sample(frac=1).reset_index(drop=True)

    trainLabels = trainData.loc[:,"Label"].as_matrix()
    testLabels = testData.loc[:,"Label"].as_matrix()

    trainData = trainData.loc[:,"Feature 0":].as_matrix()
    testData  = testData.loc[:,"Feature 0":].as_matrix()

    return (trainData, testData, trainLabels, testLabels)

# prepare train & test data
(X_train, X_test, y_train, y_test) = split_to_train_test (pddata)

# Convert labels to one-hot notation
Y_train = np_utils.to_categorical(y_train, 3)
Y_test  = np_utils.to_categorical(y_test, 3)

# Define model in Keras
def create_model(init):
    model = Sequential()
    model.add(Dense(101, input_shape=(101,), kernel_initializer=init, activation='tanh'))
    model.add(Dense(101, kernel_initializer=init, activation='tanh'))
    model.add(Dense(101, kernel_initializer=init, activation='tanh'))
    model.add(Dense(101, kernel_initializer=init, activation='tanh'))
    model.add(Dense(3, kernel_initializer=init, activation='softmax'))
    return model

# Train the model
uniform_model = create_model("glorot_normal")
uniform_model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
uniform_model.fit(X_train, Y_train, batch_size=1, epochs=300, verbose=1, validation_data=(X_test, Y_test)) 

不确定是否最佳方式,但可以创建一个巨大的批次并进行训练。如果出现OOM错误,则是GPU问题;如果冻结您的计算机,则是CPU问题。 - Daniel Möller
你可以尝试的另一件事是在声明模型之前强制使用GPU设备: with tf.device('/gpu:0'): - Daniel Möller
当我将batch_size设置为32甚至64时,程序表现出相同的行为-执行速度变慢。在纯CPU上,与相同设置的MacBook Pro相比,速度降低了两倍。 - Yury Kochubeev
将代码更改为使用 with tf.device('/gpu:0'): 运行,但与我的 MacBook Pro 相比,执行时间仍然非常慢... - Yury Kochubeev
3个回答

6

您需要在TensorFlow会话中设置log_device_placement = True来运行您的网络(在下面示例代码中的倒数第二行)。有趣的是,如果您在会话中设置了这个参数,它将在Keras进行拟合时仍然适用。因此,下面的代码(经过测试)确实输出每个张量的位置。请注意,由于您的数据不可用,我已经缩短了数据读取时间,所以我只是使用随机数据运行网络。这样编写的代码是自包含的,任何人都可以运行它。另一个注意事项:如果您从Jupyter Notebook中运行此代码,则log_device_placement的输出将进入Jupyter Notebook启动的终端,而不是notebook单元格的输出。

from tensorflow.contrib.keras.api.keras.models import Sequential
from tensorflow.contrib.keras.api.keras.layers import Dense
from tensorflow.contrib.keras.api.keras.layers import Dropout
from tensorflow.contrib.keras.api.keras import utils as np_utils
import numpy as np
import pandas as pd
import tensorflow as tf

# Read data
#pddata=pd.read_csv('data/data.csv', delimiter=';')
pddata = "foobar"

# Helper function (prepare & test data)
def split_to_train_test (data):

    return (
        np.random.uniform( size = ( 100, 101 ) ),
        np.random.uniform( size = ( 100, 101 ) ),
        np.random.randint( 0, size = ( 100 ), high = 3 ),
        np.random.randint( 0, size = ( 100 ), high = 3 )
    )

    trainLenght = len(data) - len(data)//10

    trainData = data.loc[:trainLenght].sample(frac=1).reset_index(drop=True)
    testData = data.loc[trainLenght+1:].sample(frac=1).reset_index(drop=True)

    trainLabels = trainData.loc[:,"Label"].as_matrix()
    testLabels = testData.loc[:,"Label"].as_matrix()

    trainData = trainData.loc[:,"Feature 0":].as_matrix()
    testData  = testData.loc[:,"Feature 0":].as_matrix()

    return (trainData, testData, trainLabels, testLabels)

# prepare train & test data
(X_train, X_test, y_train, y_test) = split_to_train_test (pddata)

# Convert labels to one-hot notation
Y_train = np_utils.to_categorical(y_train, 3)
Y_test  = np_utils.to_categorical(y_test, 3)

# Define model in Keras
def create_model(init):
    model = Sequential()
    model.add(Dense(101, input_shape=(101,), kernel_initializer=init, activation='tanh'))
    model.add(Dense(101, kernel_initializer=init, activation='tanh'))
    model.add(Dense(101, kernel_initializer=init, activation='tanh'))
    model.add(Dense(101, kernel_initializer=init, activation='tanh'))
    model.add(Dense(3, kernel_initializer=init, activation='softmax'))
    return model

# Train the model
uniform_model = create_model("glorot_normal")
uniform_model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
with tf.Session( config = tf.ConfigProto( log_device_placement = True ) ):
    uniform_model.fit(X_train, Y_train, batch_size=1, epochs=300, verbose=1, validation_data=(X_test, Y_test)) 

终端输出(部分,过长未完整显示):

...
VarIsInitializedOp_13: (VarIsInitializedOp): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-21 21:54:33.485870: I tensorflow/core/common_runtime/placer.cc:884]
VarIsInitializedOp_13: (VarIsInitializedOp)/job:localhost/replica:0/task:0/device:GPU:0
training/SGD/mul_18/ReadVariableOp: (ReadVariableOp): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-21 21:54:33.485895: I tensorflow/core/common_runtime/placer.cc:884]
training/SGD/mul_18/ReadVariableOp: (ReadVariableOp)/job:localhost/replica:0/task:0/device:GPU:0
training/SGD/Variable_9/Read/ReadVariableOp: (ReadVariableOp): /job:localhost/replica:0/task:0/device:GPU:0
2018-04-21 21:54:33.485903: I tensorflow/core/common_runtime/placer.cc:884]
training/SGD/Variable_9/Read/ReadVariableOp: (ReadVariableOp)/job:localhost/replica:0/task:0/device:GPU:0
...

注意许多行末尾都有GPU:0

TensorFlow手册相关页面:使用GPU:记录设备放置


是的,你说得对 - log_device_placement - 表示我的训练在 GPU 上运行... 奇怪的是,在 GPU 上每个 Epoch 需要 230 秒,而在 MacBook 上仅需要 120 秒... - Yury Kochubeev
也可以在 https://colab.research.google.com 上尝试。确保进入“运行时”,“更改运行时类型”,并将“硬件加速器”设置为GPU。看看是否更快。如果是,那么你使用的服务可能不够快... - Peter Szoldan
1
谢谢Peter,我的代码实际上使用了GPU,但奇怪的是,GPU的执行速度比CPU还要慢... - Yury Kochubeev
也许可以尝试使用不同于Paperspace的云服务提供商,看看是否会更好。顺便说一下,Colab是免费的。可能是因为Paperspace有太多的客户,导致他们的GPU负担过重。 - Peter Szoldan

1
将以下内容放在您的Jupyter笔记本顶部附近。注释掉您不需要的内容。
# confirm TensorFlow sees the GPU
from tensorflow.python.client import device_lib
assert 'GPU' in str(device_lib.list_local_devices())

# confirm Keras sees the GPU (for TensorFlow 1.X + Keras)
from keras import backend
assert len(backend.tensorflow_backend._get_available_gpus()) > 0

# confirm PyTorch sees the GPU
from torch import cuda
assert cuda.is_available()
assert cuda.device_count() > 0
print(cuda.get_device_name(cuda.current_device()))

注意: 随着TensorFlow 2.0的发布,Keras现在已经作为TF API的一部分包含在内。

最初的答案在这里


0

考虑到自从2.0版本以后,keras已经成为tensorflow的内置模块:

import tensorflow as tf
tf.test.is_built_with_cuda()  
tf.test.is_gpu_available(cuda_only = True)  

注意:后一种方法可能需要几分钟才能运行。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接