为什么我的Matlab for循环代码比向量化版本更快

Question

为什么我的Matlab for循环代码比向量化版本更快

performancematlabvectorvectorization

6

我一直听说在MATLAB中，向量化的代码比for循环运行速度更快。然而，当我尝试将我的MATLAB代码向量化时，它似乎运行得更慢了。

我使用了tic和toc来测量时间。我只改变了程序中一个函数的实现方式。我的向量化版本运行了47.228801秒，而我的for循环版本运行了16.962089秒。

此外，在我的主程序中，我使用了一个很大的N值，即N = 1000000，数据集大小为1 301，我对每个版本都进行了多次运行，使用相同大小和N的不同数据集。

为什么向量化的速度如此之慢，我该如何进一步提高速度？

“向量化”版本

function [RNGSet] = RNGAnal(N,DataSet)
%Creates a random number generated set of numbers to check accuracy overall
%   This function will produce random numbers and normalize a new Data set
%   that is derived from an old data set by multiply random numbers and
%   then dividing by N/2
randData = randint(N,length(DataSet));
tempData = repmat(DataSet,N,1);
RNGSet = randData .* tempData;
RNGSet = sum(RNGSet,1) / (N/2); % sum and normalize by the N
end

“for循环”版本

function [RNGData] = RNGAnsys(N,Data)
%RNGAnsys This function produces statistical RNG data using a for loop
%   This function will produce RNGData that will be used to plot on another
%   plot that possesses the actual data
multData = zeros(N,length(Data));
for i = 1:length(Data)
    photAbs = randint(N,1); % Create N number of random 0's or 1's
    multData(:,i) = Data(i) * photAbs; % multiply each element in the molar data by the random numbers
end

sumData = sum(multData,1); % sum each individual energy level's data point
RNGData = (sumData/(N/2))'; % divide by n, but account for 0.5 average by n/2
end

- MichaelGofron

2

你能否使用二进制单例扩展（bsxfun）来避免使用 repmat？ - Ben Voigt

2

repmat通常比bsxfun慢：尝试使用bsxfun(@times, randData, DataSet)。此外，在旧版本的Matlab中，循环比向量化更慢，但使用JIT编译器和预分配的循环可以相当有效。 - Dan

我使用了bsxfun(@times,randData,Dataset)函数，并在使用后计时程序。我节省了约4秒钟，现在运行时间为43.182087秒。但仍然比for循环版本慢得多。 - MichaelGofron

3个回答

3

尝试使用matlab分析器来确定哪行或哪几行代码使用的时间最长。这样，您就可以找出是否像所建议的那样，repmat函数正在拖慢您的速度。请告诉我们您发现了什么，我很感兴趣！

- Trogdor

2

这并没有提供问题的答案。如果你想对作者进行批判或请求澄清，可以在他们的帖子下方留言。你可以随时在自己的帖子下面发表评论，并且一旦你拥有足够的声望，你将能够在任何帖子上发表评论。 - ScottJShea

由于我无法在他们的帖子下留言，这是我能提供帮助的唯一方式。我意识到我的回答并没有涵盖整个问题，但我决定将其发布为答案，因为原始问题包括“如何进一步提高速度？” - Trogdor

0

randData = randint(N,length(DataSet));

该语句分配了一个1.2GB的数组。(4*301*1000000)。隐式地在程序中创建了最多4个这样的大数组，导致连续的缓存未命中。

你的for循环代码几乎可以在处理器缓存中运行(或者在更大的Xeon上确实如此)。

- Alexander Kemp

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Divakar · Accepted Answer

向量化

第一眼看到for循环代码，我们可以看出由于photAbs是一个二进制数组，每一列都根据Data的每个元素进行缩放，因此这个二进制特征可以用于向量化。代码中滥用了这一点——

function RNGData = RNGAnsys_vect1(N,Data)

%// Get the 2D Matrix of random ones and zeros
photAbsAll = randint(N,numel(Data));

%// Take care of multData internally by summing along the columns of the
%// binary 2D matrix and then multiply each element of it with each scalar 
%// taken from Data by performing elementwise multiplication
sumData = Data.*sum(photAbsAll,1);

%// Divide by n, but account for 0.5 average by n/2
RNGData = (sumData./(N/2))'; %//'

return;

经过分析，瓶颈出现在随机二进制数组的创建部分。因此，可以按照这个聪明的解决方案所建议的使用更快的随机二进制数组创建器来进一步优化上述函数 -

function RNGData = RNGAnsys_vect2(N,Data)

%// Create a random binary array and sum along the columns on the fly to
%// save on any variable space that would be required otherwise. 
%// Also perform the elementwise multiplication as discussed before.
sumData = Data.*sum(rand(N,numel(Data))<0.5,1);

%// Divide by n, but account for 0.5 average by n/2
RNGData = (sumData./(N/2))'; %//'

return;

使用智能二进制随机数组生成器，可以对原始代码进行优化，这将在稍后用于优化后的 for 循环和向量化代码之间的公平基准测试。优化后的 for 循环代码如下所示 -

function RNGData = RNGAnsys_opt1(N,Data)

multData = zeros(N,numel(Data));
for i = 1:numel(Data)

    %// Create N number of random 0's or 1's using a smart approach
    %// Then, multiply each element in the molar data by the random numbers
    multData(:,i) = Data(i) * rand(N,1)<.5; 
end

sumData = sum(multData,1); % sum each individual energy level's data point
RNGData = (sumData/(N/2))'; % divide by n, but account for 0.5 average by n/2
return;

基准测试

基准测试代码

N = 15000; %// Kept at this value as it going out of memory with higher N's.
           %// Size of dataset is more important anyway as that decides how
           %// well is vectorized code against a for-loop code

DS_arr = [50 100 200 500 800 1500 5000]; %// Dataset sizes
timeall = zeros(2,numel(DS_arr));

for k1 = 1:numel(DS_arr)
    DS = DS_arr(k1);
    Data = rand(1,DS);

    f = @() RNGAnsys_opt1(N,Data);%// Optimized for-loop code
    timeall(1,k1) = timeit(f);
    clear f

    f = @() RNGAnsys_vect2(N,Data);%// Vectorized Code
    timeall(2,k1) = timeit(f);
    clear f
end

%// Display benchmark results
figure,hold on, grid on
plot(DS_arr,timeall(1,:),'-ro')
plot(DS_arr,timeall(2,:),'-kx')
legend('Optimized for-loop code','Vectorized code')
xlabel('Dataset size ->'),ylabel('Time(sec) ->')
avg_speedup = mean(timeall(1,:)./timeall(2,:))
title(['Average Speedup with vectorized code = ' num2str(avg_speedup) 'x'])

结果

enter image description here

结论

根据我迄今为止使用 MATLAB 的经验，无论是 for 循环还是向量化技术都不适用于所有情况，一切都是具体问题具体分析。