Matlab中的CUDA if语句

Question

Matlab中的CUDA if语句

3

我有以下的Matlab代码：

randarray = gpuArray(rand(N,1));
N = 1000;

tic
g=0;
for i=1:N

    if randarray(i)>10
        g=g+1;
    end

end
toc

secondrandarray = rand(N,1);
 g=0;

 tic 
for i=1:N

    if secondrandarray(i)>10
        g=g+1;
    end

end
toc



Elapsed time is 0.221710 seconds.
Elapsed time is 0.000012 seconds.

1) 为什么if语句在GPU上很慢？它减缓了我所有的优化尝试。

2) 我该怎么办才能克服这个限制？

谢谢。

- RNs_Ghost

4个回答

1

使用MATLAB R2011b和Parallel Computing Toolbox在现在相当老的GPU（Tesla C1060）上，这是我看到的：

>> g = 100*parallel.gpu.GPUArray.rand(1, 1000);
>> tic, sum(g>10); toc
Elapsed time is 0.000474 seconds.

每次单独操作 gpuArray 的标量元素都会很慢，因此使用 sum 方法会更快。

- Edric

1

我不是Matlab的gpuArray实现专家，但我认为第一个循环中每个randarray(i)访问都会触发一个PCI-e事务来从GPU内存检索值，这将产生非常大的延迟惩罚。您最好通过调用gather以单个事务传输整个数组，然后在主机内存中循环本地副本。

- talonmies

0

我无法对之前的解决方案发表评论，因为我太新了，但是可以在Pavan的解决方案上进行扩展。nnz函数尚未在gpuArrays上实现，至少在我使用的Matlab版本（R2012a）中是如此。

通常情况下，向量化Matlab代码要好得多。但是，在某些情况下，由于JIT编译，循环代码在Matlab中运行速度可能会很快。

检查结果来自

N = 1000;
randarray_cpu = rand(N,1);
randarray_gpu = gpuArray(randarray_cpu);
threshold     = 0.5;

% CPU: looped
g=0;
tic
for i=1:N
    if randarray_cpu(i)>threshold
        g=g+1;
    end
end
toc

% CPU: vectorized
tic
g = nnz(randarray_cpu>threshold);
toc

% GPU: looped
tic
g=0;
for i=1:N
    if randarray_gpu(i)>threshold
        g=g+1;
    end
end
toc

% GPU: vectorized
tic
g_d = sum(randarray_gpu > threshold);
g = gather(g_d); % I'm assuming that you want this in the CPU at some point
toc

这是在我的Core i7 + GeForce 560Ti上：

Elapsed time is 0.000014 seconds.
Elapsed time is 0.000580 seconds.
Elapsed time is 0.310218 seconds.
Elapsed time is 0.000558 seconds.

所以从这个案例中我们可以看到：

在Matlab中，循环不被认为是好的实践方法，但在你的特定情况下，它运行得很快，因为Matlab在内部以某种方式“预编译”它。我将你的阈值从10改为了0.5，因为rand永远不会给出大于1的值。

循环的GPU版本表现很差，因为在每次循环迭代时，都会启动一个内核（或者从GPU读取数据，无论TMW如何实现...），这很慢。在计算基本上什么都没有的情况下进行大量小内存传输是在GPU上做的最糟糕的事情。

从最后（最好）的GPU结果来看，答案是：除非数据已经在GPU上了，否则在GPU上计算这个并没有意义。由于你的操作的算术复杂度基本上不存在，内存传输开销在任何方面都不值得。如果这是更大的GPU计算的一部分，那没问题。如果不是...最好还是使用CPU吧 ;)

- Josep

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pavan Yalamanchili · Accepted Answer

无论是在CPU还是GPU上执行，这通常都是不好的做法。

以下是进行您所需操作的好方法。

N = 1000;
randarray = gpuArray(100 * rand(N,1));
tic
g = nnz(randarray > 10);
toc

我没有PCT，无法验证这是否有效（GPU支持的功能数量相当有限）。

但是，如果您拥有Jacket，您肯定能够执行以下操作。

N = 1000;
randarray = gdouble(100 * rand(N, 1));
tic
g = nnz(randarray > 10);
toc

全面披露一下：我是Jacket软件的开发工程师之一。