如何优化内存访问模式/减少缓存失效，以适用于此数组抽取/下采样程序？

Question

如何优化内存访问模式/减少缓存失效，以适用于此数组抽取/下采样程序？

9

我最近被要求提供一段能够“就地”降采样/下采样数组的代码。这个“降采样”函数接受一个整数数组，并将数组中偶数索引i处的条目存储在索引i/2处。对于数组中所有条目都是如此。

这会将原始数组中的所有偶索引条目移动到数组的前半部分。然后可以将数组的其余部分初始化为0。总体结果是一个数组，保留了原始数组中所有偶数索引条目（通过将它们移动到前半部分），并且数组的后半部分为0。显然，这用于信号处理中的降采样。

代码大致如下：

void decimate (vector<int>& a) {
   int sz = a.size();
   for (int i =0; i < sz; i++) {
     if (i%2 == 0) {
        a[i/2] = a[i];
     }
    }
    for (int i =(sz-1)/2; i < sz; i++) a[i] = 0;
}

在建议将某些变量保留在寄存器中的基本改进后，我无法找到进一步优化的方法，但不确定是否可以做到。

有没有办法优化循环中的内存访问模式以获得更好的缓存性能？ 或者通过矢量化等方式优化压缩/降采样数组的主要复制操作以使其更高效？（例如对于支持该功能的平台）

   for (int i =0; i < sz; i++) {
     if (i%2 == 0) {
        a[i/2] = a[i];
     }
    }

是否有任何循环转换（例如平铺/条带挖掘），可以为这种减少循环产生高效的代码？ 编辑：下面的答案中提出了一些不同的方法，似乎利用了memset/fill或指针算术来提高速度效率。本问题主要关注是否存在明确定义的循环变换，可以显著改善局部性或缓存失效（例如，如果它是一个循环嵌套，有两个循环，可以潜在地研究循环平铺以优化缓存失效）。

- Joe Black

5个回答

3

这里有一个版本，使用指针算术和放置 new，利用了 std::vector 内部使用连续内存布局的事实：

void down_sample(std::vector<int> & v){ 
    int * begin = &v[0];
    int * stop =  begin + v.size();
    int * position = begin + 2;
    int * half_position = begin +1;
    while( position < stop){
        *half_position = *position;
        ++half_position;
        position += 2;
    }
    size_t size = v.size()/2;
    int * a = new (half_position) int[size]();
}

在我的机器上，这段代码在禁用优化时运行速度比你的快3倍，在使用gcc7.2编译时加入-o3参数后比你的版本快30%。我使用了20000000个元素大小的向量进行测试。

而且，我认为在你的版本中，这一行：

for (int i =(sz-1)/2; i < sz; i++) a[i] = 0;

应该是

for (int i =(sz-1)/2 + 1; i < sz; i++) a[i] = 0;

否则将会设置太多的元素为零。

考虑到John Zwinck的问题，我进行了一些使用memset和std::fill而不是placement new的快速测试。

以下是结果：

n = 20000000
compiled with -o0
orginal 0.111396 seconds
mine    0.0327938 seconds
memset  0.0303007 seconds
fill    0.0507268 seconds

compiled with -o3
orginal 0.0181994 seconds
mine    0.014135 seconds
memset  0.0141561 seconds
fill    0.0138893 seconds

n = 2000
compiled with -o0
orginal 3.0119e-05 seconds
mine    9.171e-06 seconds
memset  9.612e-06 seconds
fill    1.3868e-05 seconds

compiled with -o3
orginal 5.404e-06 seconds
mine    2.105e-06 seconds
memset  2.04e-06 seconds
fill    1.955e-06 seconds

n= 500000000 (with -o3)
mine=     0,350732
memeset = 0.349054  
fill =    0.352398

在大型向量上，memset似乎稍微快一点，而std::fill在小型向量上稍微快一些。但是差异非常小。

- Hatatister

这个版本比@JohnZwink的更快。我认为你应该使用stop = begin + v.size()而不是取出向量范围之外的元素的地址。 - user2807083

int * a = new (half_position) int[size]() 相对于 memset(half_position, 0, size * sizeof(int) 或者 std::fill(half_position, half_position + size, 0) 有什么优劣之处吗？ - John Zwinck

我一开始没有考虑到这两种方法。 - Hatatister

当启用优化编译时，它们的执行时间似乎几乎相同。（抱歉格式不好看。注释中换行符被忽略了。也许我应该为此创建一个新的答案？） - Hatatister

你可以[编辑]这个答案并添加新信息。你不需要发布一个新的。但我看到你已经发现了这一点。 :-) - Cody Gray

显示剩余3条评论

1

我实现的一次性 decimate() 版本：

void decimate (std::vector<int>& a) {
    const std::size_t sz = a.size();
    const std::size_t half = sz / 2;

    bool size_even = ((sz % 2) == 0);

    std::size_t index = 2;
    for (; index < half; index += 2) {
        a[index/2] = a[index];
    }

    for (; index < sz; ++index) {
        a[(index+1)/2] = a[index];
        a[index] = 0;
    }

    if (size_even && (half < sz)) {
        a[half] = 0;
    }
}

并为此进行测试：

#include <vector>
#include <iostream>
#include <cstddef>

void decimate(std::vector<int> &v);

void print(std::vector<int> &a) {
    std::cout << "{";
    bool f = false;

    for(auto i:a) {
        if (f) std::cout << ", ";
        std::cout << i;
        f = true;
    }
    std::cout << "}" << std::endl;
}

void test(std::vector<int> v1, std::vector<int> v2) {
    auto v = v1;
    decimate(v1);

    bool ok = true;

    for(std::size_t i = 0; i < v1.size(); ++i) {
        ok = (ok && (v1[i] == v2[i]));
    }

    if (ok) {
        print(v);
        print(v1);
    } else {
        print(v);
        print(v1);
        print(v2);
    }
    std::cout << "--------- " << (ok?"ok":"fail") << "\n" << std::endl;
}

int main(int, char**)
{
    test({},
        {});

    test({1},
        {1});

    test({1, 2},
        {1, 0});

    test({1, 2, 3},
        {1, 3, 0});

    test({1, 2, 3, 4},
        {1, 3, 0, 0});

    test({1, 2, 3, 4, 5},
        {1, 3, 5, 0, 0});

    test({1, 2, 3, 4, 5, 6},
        {1, 3, 5, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7},
        {1, 3, 5, 7, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8},
        {1, 3, 5, 7, 0, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8, 9},
        {1, 3, 5, 7, 9, 0, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
        {1, 3, 5, 7, 9, 0, 0, 0, 0, 0});

    test({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11},
        {1, 3, 5, 7, 9, 11, 0, 0, 0, 0, 0});

    return 0;
}

- user2807083

在我的系统上，这与原始速度大致相同。那么，在您的系统上呢？ - John Zwinck

在我的系统上，我的版本需要大约原始版本时间的72％，而你的版本需要大约原始版本时间的59％。 - user2807083

@Hatatister的版本只需要46％的时间来处理500,000,000个元素向量。 - user2807083

修改了 q 以反映重点。 - Joe Black

0

如果之后将其设置为零，请勿将其增加到sz。

如果sz是偶数，则转到sz/2，否则转到(sz-1)/2。

for (int i =0; i < sz_half; i++) 
        a[i] = a[2*i];

- schorsch312

是的，这应该没问题，但我不指望它会有显著的加速效果，因为我们在 i 为奇数时跳过了复制，也就是说，代价高昂的操作仍然存在（除了循环内部的 if 条件语句，编译器可能会优化掉）。 - Joe Black

0

我比较了这里提供的所有答案。我使用了 Intel 编译器 icc 版本 15.0.3。优化级别 O3 被使用。

Orig: Time difference [micro s] = 79506
JohnZwinck: Time difference [micro s] = 69127   
Hatatister: Time difference [micro s] = 79838
user2807083: Time difference [micro s] = 80000
Schorsch312: Time difference [micro s] = 84491

所有时间都是指长度为100000000的向量。

#include <vector>
#include <cstddef>
#include <iostream>
#include <chrono>

const int MAX = 100000000;

void setup(std::vector<int> & v){
    for (int i = 0 ; i< MAX; i++) {
        v.push_back(i);
    }
}


void checkResult(std::vector<int> & v) {
    int half_length;
    if (MAX%2==0)
        half_length = MAX/2;
    else
        half_length = MAX-1/2;

    for (int i = 0 ; i< half_length; i++) {
        if (v[i] != i*2)
            std::cout << "Error: v[i]="  << v[i] << " but should be "  <<     2*i <<  "\n";
    }

    for (int i = half_length+1; i< MAX; i++) {
        if (v[i] != 0)
            std::cout << "Error: v[i]="  << v[i] << " but should be 0 \n";
    }
}

void down_sample(){
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    int * begin = &v[0];
    int * stop =  begin + v.size();
    int * position = begin + 2;
    int * half_position = begin +1;
    while( position < stop){
        *half_position = *position;
        ++half_position;
        position += 2;
    }
    size_t size = v.size()/2;
    int * a = new (half_position) int[size]();

    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "Orig: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;
    checkResult(v);
}

void down_sample_JohnZwinck () {
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    size_t slow = 1, fast = 2;

    // read the first half, write the first quarter
    size_t stop = (v.size()+1)/2;
    while (fast < stop) {
        v[slow++] = v[fast];
        fast += 2;
    }

    // read and clear the second half, write the second quarter
    stop = v.size();
    while (fast < stop) {
        v[slow++] = v[fast];
        v[fast++] = 0;
        v[fast++] = 0;
    }

    // clean up (only really needed when length is even)
    v[slow] = 0;

    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "JohnZwinck: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;
    checkResult(v);

}

void down_sample_Schorsch312(){ 
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();
    int half_length;

    if (v.size()%2==0)
        half_length = MAX/2;
    else
        half_length = MAX-1/2;

    for (int i=0; i < half_length; i++) 
        v[i] = v[2*i];
    for (int i=half_length+1; i< MAX; i++) 
        v[i]=0;

    auto duration = std::chrono::steady_clock::now() - start_time;
std::cout << "Schorsch312: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;
}

void down_sample_Hatatister(){ 
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    int * begin = &v[0];
    int * stop =  begin + v.size();
    int * position = begin + 2;
    int * half_position = begin +1;

    while( position < stop){
        *half_position = *position;
        ++half_position;
        position += 2;
    }
    size_t size = v.size()/2;
    int * a = new (half_position) int[size]();
    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "Hatatister: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;

    checkResult(v);
}

void down_sample_user2807083 () {
    std::vector<int> v;
    setup(v);

    auto start_time = std::chrono::steady_clock::now();

    const std::size_t sz = v.size();
    const std::size_t half = sz / 2;

    bool size_even = ((sz % 2) == 0);

    std::size_t index = 2;

    for (; index < half; index += 2) {
        v[index/2] = v[index];
    }

    for (; index < sz; ++index) {
        v[(index+1)/2] = v[index];
        v[index] = 0;
    }

    if (size_even && (half < sz)) {
        v[half] = 0;
    }
    auto duration = std::chrono::steady_clock::now() - start_time;
    std::cout << "user2807083: Time difference [micro s] = " << std::chrono::duration_cast<std::chrono::microseconds>(duration).count() <<std::endl;

    checkResult(v);

}

int main () {
    down_sample();
    down_sample_JohnZwinck ();
    down_sample_Schorsch312();
    down_sample_Hatatister();
    down_sample_user2807083();
}

- schorsch312

有趣。这表明性能优化可能很困难。而且性能提升可能与您的代码无关，可能是由其他因素引起的。即使像定义环境变量这样的事情也会产生影响。关于这个话题有一个有趣的 CppCon 演讲，但我找不到链接了。 - Hatatister

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- John Zwinck · Accepted Answer

你有一个这样的数组：

0 1 2 3 4 5 6 7 8 9

您想要达到这样的效果：

0 2 4 6 8 0 0 0 0 0

我会这样做：

void decimate (vector<int>& a) {
  size_t slow = 1, fast = 2;

  // read the first half, write the first quarter
  size_t stop = (a.size()+1)/2;
  while (fast < stop) {
    a[slow++] = a[fast];
    fast += 2;
  }

  // read and clear the second half, write the second quarter
  stop = a.size();
  while (fast < stop) {
    a[slow++] = a[fast];
    a[fast++] = 0;
    a[fast++] = 0;
  }

  // clean up (only really needed when length is even)
  a[slow] = 0;
}

在我的系统上，这个版本大约比你原来的版本快20%。

现在轮到你测试并告诉我们它在你的系统上是否更快了！