在使用 theano 和 lasagne 训练深度卷积神经网络时,我遇到了一些性能问题。我进行了一些实验来调查问题所在。其中一个发现是从主存储器加载图像批次到 GPU 要花费很长时间。下面是一个最小化的示例,用于说明这个问题。它简单地计算了评估 Theano 标识函数在批次大小为 1、2、4、8、16... 的图像批次上所需的时间。我正在处理大小为 448x448 的 RGB 图像。
import numpy as np
import theano
import theano.tensor as T
import time
var = T.ftensor4('inputs')
f = theano.function([var], var)
for batchsize in [2**i for i in range(6)]:
X = np.zeros((batchsize,3,448,448), dtype=np.float32)
print "Batchsize", batchsize
times = []
start = time.time()
for i in range(1000):
f(X)
times.append(time.time()-start)
start = time.time()
print "-> Function evaluation takes:", np.mean(times), "+/-", np.std(times), "sec"
我的结果如下:
Batchsize 1
-> Function evaluation takes: 0.000177580833435 +/- 2.78762612138e-05 sec
Batchsize 2
-> Function evaluation takes: 0.000321553707123 +/- 2.4221262933e-05 sec
Batchsize 4
-> Function evaluation takes: 0.000669012069702 +/- 0.000896798280943 sec
Batchsize 8
-> Function evaluation takes: 0.00137474012375 +/- 0.0032982626882 sec
Batchsize 16
-> Function evaluation takes: 0.176659427643 +/- 0.0330068803715 sec
Batchsize 32
-> Function evaluation takes: 0.356572513342 +/- 0.074931685704 sec
注意,当将批次大小从8增加到16时,因子增加了100倍。这是正常的吗?还是我有某种技术问题?如果是这样,请问您有任何想法可能是从哪里来的?感谢您的任何帮助。如果您运行代码片段并报告您所看到的内容,那也会有所帮助。
编辑: 丹尼尔·伦肖指出,这可能与主机-GPU复制无关。您有什么其他想法可能导致问题?一些更多信息:
函数的theano调试打印如下:
DeepCopyOp [@A] 'inputs' 0
|inputs [@B]
Theano性能分析的输出结果:
Function profiling
==================
Message: theano_test.py:14
Time in 6000 calls to Function.__call__: 3.711728e+03s
Time in Function.fn.__call__: 3.711528e+03s (99.995%)
Time in thunks: 3.711491e+03s (99.994%)
Total compile time: 6.542931e-01s
Number of Apply nodes: 1
Theano Optimizer time: 7.912159e-03s
Theano validate time: 0.000000e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 8.321500e-02s
Import time 2.951717e-02s
Time in all call to theano.grad() 0.000000e+00s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0% 100.0% 3711.491s 6.19e-01s C 6000 1 theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0% 100.0% 3711.491s 6.19e-01s C 6000 1 DeepCopyOp
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
100.0% 100.0% 3711.491s 6.19e-01s 6000 0 DeepCopyOp(inputs)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
INFO (theano.gof.compilelock): Waiting for existing lock by process '3642' (I am process '22124')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/bal8rng/.theano/compiledir_Linux-3.16--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.10-64/lock_dir
THEANO_FLAGS:
floatX=float32,device=gpu,optimizer_including=conv_meta,mode=FAST_RUN,blas.ldflags="-L/usr/lib/openblas-base -lopenblas",device=gpu3,assert_no_cpu_op=raise