Theano在将数据复制到GPU时存在性能问题。

Question

Theano在将数据复制到GPU时存在性能问题。

3

在使用 theano 和 lasagne 训练深度卷积神经网络时，我遇到了一些性能问题。我进行了一些实验来调查问题所在。其中一个发现是从主存储器加载图像批次到 GPU 要花费很长时间。下面是一个最小化的示例，用于说明这个问题。它简单地计算了评估 Theano 标识函数在批次大小为 1、2、4、8、16... 的图像批次上所需的时间。我正在处理大小为 448x448 的 RGB 图像。

import numpy as np
import theano
import theano.tensor as T
import time

var = T.ftensor4('inputs')
f = theano.function([var], var)

for batchsize in [2**i for i in range(6)]:
    X = np.zeros((batchsize,3,448,448), dtype=np.float32)
    print "Batchsize", batchsize
    times = []
    start = time.time()
    for i in range(1000):
        f(X)
        times.append(time.time()-start)
        start = time.time()
    print "-> Function evaluation takes:", np.mean(times), "+/-", np.std(times), "sec"

我的结果如下：

Batchsize 1
-> Function evaluation takes: 0.000177580833435 +/- 2.78762612138e-05 sec
Batchsize 2
-> Function evaluation takes: 0.000321553707123 +/- 2.4221262933e-05 sec
Batchsize 4
-> Function evaluation takes: 0.000669012069702 +/- 0.000896798280943 sec
Batchsize 8
-> Function evaluation takes: 0.00137474012375 +/- 0.0032982626882 sec
Batchsize 16
-> Function evaluation takes: 0.176659427643 +/- 0.0330068803715 sec
Batchsize 32
-> Function evaluation takes: 0.356572513342 +/- 0.074931685704 sec

注意，当将批次大小从8增加到16时，因子增加了100倍。这是正常的吗？还是我有某种技术问题？如果是这样，请问您有任何想法可能是从哪里来的？感谢您的任何帮助。如果您运行代码片段并报告您所看到的内容，那也会有所帮助。

编辑：丹尼尔·伦肖指出，这可能与主机-GPU复制无关。您有什么其他想法可能导致问题？一些更多信息：

函数的theano调试打印如下：

DeepCopyOp [@A] 'inputs'   0
 |inputs [@B]

Theano性能分析的输出结果:

Function profiling                                                      
================== 
Message: theano_test.py:14
Time in 6000 calls to Function.__call__: 3.711728e+03s
Time in Function.fn.__call__: 3.711528e+03s (99.995%)                       
Time in thunks: 3.711491e+03s (99.994%)
Total compile time: 6.542931e-01s
    Number of Apply nodes: 1
    Theano Optimizer time: 7.912159e-03s
        Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 8.321500e-02s
        Import time 2.951717e-02s

Time in all call to theano.grad() 0.000000e+00s
Class 
---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0%   100.0%     3711.491s       6.19e-01s     C     6000       1   theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0%   100.0%     3711.491s       6.19e-01s     C     6000        1   DeepCopyOp
... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
100.0%   100.0%     3711.491s       6.19e-01s   6000     0 DeepCopyOp(inputs)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

INFO (theano.gof.compilelock): Waiting for existing lock by process '3642' (I am process '22124')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/bal8rng/.theano/compiledir_Linux-3.16--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.10-64/lock_dir

THEANO_FLAGS: floatX=float32,device=gpu,optimizer_including=conv_meta,mode=FAST_RUN,blas.ldflags="-L/usr/lib/openblas-base -lopenblas",device=gpu3,assert_no_cpu_op=raise

THEANO_FLAGS是一个环境变量，它允许您在使用Theano库时设置不同的选项。上述代码段中列出了一些选项，其中包括：使用浮点数精度为float32，使用gpu设备，包括卷积元数据优化器，使用快速运行模式，链接OpenBLAS库，并使用第三个GPU设备，如果CPU操作发生，则引发异常。

- lballes

关于您更新的问题，您真的在意吗？探索更有意义的计算的性能特征不是更有成果吗？或者您已经确定这种行为在更实际的情况下也会给您带来问题了吗？ - Daniel Renshaw

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Daniel Renshaw · Accepted Answer

你的计算几乎肯定没有在GPU上运行！只要你使用标准配置标志，Theano的优化器就足够聪明，能够看到实际上没有执行任何操作，因此它不会在编译的计算中添加任何"将数据移动到GPU"和"将数据从GPU移回"的操作。你可以在f = theano.function([var], var)行之后添加以下行来查看这一点。

theano.printing.debugprint(f)

如果您想了解数据在GPU和CPU之间移动的开销，建议使用Theano内置的分析工具。打开分析功能，在输出中查看在和操作中花费了多少时间。当然，这必须通过更有意义的计算来完成，其中确实需要移动数据。

不过，你所看到的结果很奇怪。如果计算确实在CPU上运行，那么我仍然不希望看到随着批量大小的增加而出现如此大的变化。但如果计算真正在GPU上运行时，这可能对您并不感兴趣。

顺便提一下，我在我的服务器上运行了您的代码（尽管配置中有device=gpu，如上所述，实际上仍在CPU上运行），但我没有得到相同的巨大步进变化；我的时间乘数分别为2.6、1.9、4.0、3.9、2.0（即从批量大小=1到批量大小=2等，时间增加了2.6倍）。