Keras在使用GPU时没有提升训练速度（部分使用GPU？！）

Question

Keras在使用GPU时没有提升训练速度（部分使用GPU？！）

12

我正在尝试在我的Jupyter Notebook上使用AWS p2.xlarge实例上的GPU而不是CPU来训练我的模型。我正在使用tensorflow-gpu后端（只安装并在requirements.txt中提到了tensorflow-gpu而不是tensorflow）。

训练模型时，我没有看到任何与使用CPU相比的速度提高，事实上，每个时代的训练速度几乎与我在4核笔记本电脑CPU上得到的速度相同（p2.xlarge还具有带有Tesla K80 GPU的4个虚拟CPU）。我不确定是否需要对我的代码进行一些更改以适应GPU可以提供的更快/并行处理。我在下面贴出了我的模型代码：

model = Sequential()
model.add(recurrent.LSTM(64, input_shape=(X_np.shape[1], X_np.shape[2]),
                        return_sequences=True))
model.add(recurrent.LSTM(64, return_sequences = False))
model.add(core.Dropout(0.1))
model.add(core.Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy'])

model.fit(X_np, y_np, epochs=100, validation_split=0.25)

有趣的是，每次我使用nvidia-smi检查GPU状态时，GPU似乎都利用了50％-60％的处理能力和几乎所有的内存（但在不训练时两者均降至0％和1MiB）：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    73W / 149W |  10919MiB / 11439MiB |     52%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1665      C   ...ubuntu/aDash/MLenv/bin/python 10906MiB |
+-----------------------------------------------------------------------------+

如果您想查看我在Jupyter Notebook中使用GPU的记录，请参见以下日志：

[I 04:21:59.390 NotebookApp] Kernel started: c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
[I 04:22:02.241 NotebookApp] Adapting to protocol v5.1 for kernel c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
2017-11-30 04:22:32.403981: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 04:22:33.653681: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-30 04:22:33.654041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 04:22:33.654070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2017-11-30 04:22:34.014329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:22:34.015339: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7

2017-11-30 04:23:22.426895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

请建议可能出现的问题。不管怎样，感谢您的关注！

- Ishaan Sejwal

你可以同时发布一下你的 CPU 使用情况吗？可能是你的瓶颈在于向模型提供数据的部分。 - dgumo

1

你的数据集大小是多少（X_np和y_np的形状）？ - Abderrahim Kitouni

@AbderrahimKitouni 输入和目标的形状分别为34000x7x5（样本数 x 时间步数 x 特征数）和34000x1。 - Ishaan Sejwal

3个回答

3

尝试在model.fit中使用一些更大的batch_size值，因为默认值为32。将其增加到获得100% CPU利用率为止。

根据@dgumo的建议，您还可以将数据放入/run/shm中。这是一个内存文件系统，可让您以最快的方式访问数据。或者，您可以确保您的数据至少驻留在SSD上。例如，在/tmp中。

- mcsim

同意。LSTM训练速度较慢，但增加批次大小应该会有很大帮助。您也可以尝试其他类型的循环层，例如GRU或新发布的[SRU]（https://github.com/titu1994/keras-SRU） - Coolness

0

你的情况瓶颈在于与GPU的数据传输。加速计算（并最大化GPU使用率）的最佳方法是一次性加载尽可能多的数据到内存中。由于你有足够的内存，你可以通过以下方式一次性将所有数据放入内存中：

model.fit(X_np, y_np, epochs=100, validation_split=0.25, batch_size=X_np.shape[0])

（在这种情况下）您还应该增加训练轮数。

但是请注意，小批量训练也有其优点（例如更好地处理局部最小值），因此您可能需要考虑选择一个介于两者之间的批量大小。

- Abderrahim Kitouni

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel Möller · Accepted Answer

那是因为您正在使用LSTM层。

Tensorflow对于LSTM层的实现并不是很好。原因可能是循环计算不是并行计算，而GPU非常适合并行处理。

我通过自己的经验证实了这一点：

在我的模型中使用LSTMs速度很慢
决定测试将所有LSTMs移除的模型（得到一个纯卷积模型）
结果速度惊人！！！

这篇关于使用GPU和tensorflow的文章也证实了这一点：

http://minimaxir.com/2017/07/cpu-or-gpu/

可能的解决方案？

您可以尝试使用新的CuDNNLSTM，这似乎是专门为使用GPU而准备的。

我从未测试过它，但您很可能会获得更好的性能。

另一件我没有测试过的事情，我不确定它是否是为此目的而设计的，但我怀疑它是：您可以在LSTM层中放置unroll=True。通过这样做，我怀疑循环计算将转变为并行计算。