这个基准测试结果的原因是什么？

Question

这个基准测试结果的原因是什么？

11

将RGB图像转换为灰度图像的两个函数：

function rgb2gray_loop{T<:FloatingPoint}(A::Array{T,3})
  r,c = size(A)
  gray = similar(A,r,c)
  for i = 1:r
    for j = 1:c
      @inbounds gray[i,j] = 0.299*A[i,j,1] + 0.587*A[i,j,2] + 0.114 *A[i,j,3]
    end
  end
  return gray
end

并且：

function rgb2gray_vec{T<:FloatingPoint}(A::Array{T,3})
  gray = similar(A,size(A)[1:2]...)
  gray = 0.299*A[:,:,1] + 0.587*A[:,:,2] + 0.114 *A[:,:,3]
  return gray
end

第一种方法使用循环，而第二种方法则使用向量化。

使用Benchmark包对它们进行基准测试时（f1是循环版本，f2是向量化版本），我得到了不同大小输入图像的以下结果： A = rand(50,50,3)：

| Row | Function | Average     | Relative | Replications |
|-----|----------|-------------|----------|--------------|
| 1   | "f1"     | 3.23746e-5  | 1.0      | 1000         |
| 2   | "f2"     | 0.000160214 | 4.94875  | 1000         |

A = rand(500,500,3):

| Row | Function | Average    | Relative | Replications |
|-----|----------|------------|----------|--------------|
| 1   | "f1"     | 0.00783007 | 1.0      | 100          |
| 2   | "f2"     | 0.0153099  | 1.95527  | 100          |

A = rand(5000,5000,3):

| Row | Function | Average  | Relative | Replications |
|-----|----------|----------|----------|--------------|
| 1   | "f1"     | 1.60534  | 2.56553  | 10           |
| 2   | "f2"     | 0.625734 | 1.0      | 10           |

我本以为一个函数会比另一个函数更快（可能是因为inbounds宏），但我无法解释为什么矢量化版本在处理更大的图像时会更快。这是为什么呢？

- reschu

2

我认为向量化版本中的语句 gray = similar(A,size(A)[1:2]...) 是不必要的，因为语言会直接从第二个语句创建正确大小的数组。但是这并不能解释向量化版本为什么更快。 - cfh

1

离题了，但如果你正在使用Images，可以说convert(Array{Gray{Float64}}, A)。 - tholy

3个回答

1

只是猜测，因为我不了解Julia-Lang：

我认为向量化形式中的语句gray = ...创建一个新的数组，其中存储了所有计算出的值，而旧数组被丢弃。在f1中，值被就地覆盖，因此不需要新的内存分配。内存分配非常昂贵，因此对于较小的数字，使用就地覆盖的循环版本更快。

但是，内存分配通常是静态开销（分配两倍的时间不会花费两倍的时间），而向量化版本的计算速度更快（可能是并行的？）因此，如果数字足够大，则更快的计算比内存分配更重要。

- Falco

2

在Julia中，矢量化操作通常比逐个元素的操作慢，因为后者产生的临时变量更少。这里的矢量化版本将创建三个临时数组，然后将它们相加，而逐个元素的版本不需要任何额外的临时变量，并且只使用单个循环。 - cfh

@cfh 我的想法是，矢量化对内存的影响更大。但另一方面，矢量化版本可以在4个核上并行计算。并且可能存在一个平衡点，在这个点上，使用4倍CPU带来的好处要比内存分配成本更多。你在四核上进行过测试吗？ - Falco

我认为目前Julia不能自动分配这些计算到多个核心上。 - cfh

0

我无法重现你的结果。

请查看这个 IJulia 笔记本：http://nbviewer.ipython.org/urls/gist.githubusercontent.com/anonymous/24c17478ae0f5562c449/raw/8d5d32c13209a6443c6d72b31e2459d70607d21b/rgb2gray.ipynb

我得到的数字是：

In [5]:

@time rgb2gray_loop(rand(50,50,3));
@time rgb2gray_vec(rand(50,50,3));

elapsed time: 7.591e-5 seconds (80344 bytes allocated)
elapsed time: 0.000108785 seconds (241192 bytes allocated)

In [6]:

@time rgb2gray_loop(rand(500,500,3));
@time rgb2gray_vec(rand(500,500,3));

elapsed time: 0.021647914 seconds (8000344 bytes allocated)
elapsed time: 0.012364489 seconds (24001192 bytes allocated)

In [7]:

@time rgb2gray_loop(rand(5000,5000,3));
@time rgb2gray_vec(rand(5000,5000,3));

elapsed time: 0.902367223 seconds (800000440 bytes allocated)
elapsed time: 1.237281103 seconds (2400001592 bytes allocated, 7.61% gc time)

正如预期，对于大输入值，循环版本更快。此外，请注意矢量化版本分配了三倍的内存。

我还想指出语句gray = similar(A,size(A)[1:2]...)是冗余的，可以省略。如果不进行这个不必要的分配，最大问题的结果为：

@time rgb2gray_loop(rand(5000,5000,3));
@time rgb2gray_vec(rand(5000,5000,3));

elapsed time: 0.953746863 seconds (800000488 bytes allocated, 3.06% gc time)
elapsed time: 1.203013639 seconds (2200001200 bytes allocated, 7.28% gc time)

因此内存使用量减少了，但速度并没有明显提高。

- cfh

我可以使用@time重现我的结果。我猜Falco是对的，结果与我的机器上的某种并行化有关.... - reschu

1

@reschu：听起来不太对。首先，Julia 不会自动并行化。此外，请注意，您给出的循环版本的时间随问题规模呈超线性增长：从第二个到第三个问题，速度慢了 200 倍，尽管问题规模只增加了 100 倍。这里一定有什么奇怪的事情发生了。 - cfh

1

是的，你说得对。我找到了它是什么。这是我在开始使用Julia时首先读到的东西：列优先顺序。在嵌套循环中交换行和列会导致预期的结果。 - reschu

1

@reschu：很有趣！你会添加一个答案来描述你做了什么吗？我也可能要撤回我的声明，即Julia不会自动并行化——它使用OpenBLAS，并且在某些情况下可以利用多个核心。因此，这可能是这两个因素的结合。 - cfh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- reschu · Accepted Answer

结果的答案是，Julia中的多维数组以列为主序存储。请参见Julias内存顺序。

修复循环版本，涉及列主序（内部和外部循环变量已交换）：

function rgb2gray_loop{T<:FloatingPoint}(A::Array{T,3})
  r,c = size(A)
  gray = similar(A,r,c)
  for j = 1:c
    for i = 1:r
      @inbounds gray[i,j] = 0.299*A[i,j,1] + 0.587*A[i,j,2] + 0.114 *A[i,j,3]
    end
  end
  return gray
end

A = rand(5000,5000,3)的新结果：

| Row | Function | Average  | Relative | Replications |
|-----|----------|----------|----------|--------------|
| 1   | "f1"     | 0.107275 | 1.0      | 10           |
| 2   | "f2"     | 0.646872 | 6.03004  | 10           |

对于较小的数组，结果如下：

A = rand(500,500,3)：

| Row | Function | Average    | Relative | Replications |
|-----|----------|------------|----------|--------------|
| 1   | "f1"     | 0.00236405 | 1.0      | 100          |
| 2   | "f2"     | 0.0207249  | 8.76671  | 100          |

A = rand(50,50,3):

| Row | Function | Average     | Relative | Replications |
|-----|----------|-------------|----------|--------------|
| 1   | "f1"     | 4.29321e-5  | 1.0      | 1000         |
| 2   | "f2"     | 0.000224518 | 5.22961  | 1000         |