布尔判断为什么如此缓慢?

5

我正在优化这个函数,我尝试了各种方法,甚至使用了SSE指令,并修改代码以从不同位置返回以查看计算时间,但最终我发现大部分时间花费在布尔判断上。即使我用一个简单的加法操作替换if语句中的所有代码,它仍然需要6000毫秒。

我的平台是gcc 4.7.1 e5506 CPU。其输入'a'和'b'是一个1000大小的整数数组,而'asize'和'bsize'是相应的数组大小。MATCH_MASK = 16383,我运行该函数100000次以统计时间间隔。有没有好的解决方案?谢谢!

   if (aoffsets[i] && boffsets[i])  // this line costs most time

代码:

uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
uint16_t* boffsets = aoffsets + MATCH_MASK;
uint8_t* seen = (uint8_t *)aoffsets;
auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
{
    for (int i = 0; i < n_size; ++i)
        offsets[MATCH_STRIP(x[i])] = i;
};
fn_init_offsets(a, asize, aoffsets);
fn_init_offsets(b, bsize, boffsets);

uint8_t topcount = 0;
int topoffset = 0;
{
    std::vector<uint8_t> count_vec(asize + bsize + 1, 0);   // it's the fastest way already, very near to tls
    uint8_t* counts = &(count_vec[0]);
            //return aoffsets[0];   // cost 1375 ms
    for (int i = 0; i < MATCH_MASK; ++i)
    {
        if (aoffsets[i] && boffsets[i])  // this line costs most time
        {
                            //++affsets[i];  // for test
            int offset = (aoffsets[i] -= boffsets[i]);
            if ((-n_maxoffset <= offset && offset <= n_maxoffset))
            {
                offset += bsize;
                uint8_t n_cur_count = ++counts[offset];
                if (n_cur_count > topcount)
                {
                    topcount = n_cur_count;
                    topoffset = offset;
                }
            }
        }
    }
}
    return aoffsets[0];   // cost 6000ms

这个函数是线程化的吗?你可能有一个线程在等待吗? - Fantastic Mr Fox
1
听起来像是分支预测失败。 - user1773602
1
你是如何对其进行基准测试的? - Cory Nelson
1
还可以参考这个经典问题:https://dev59.com/iWgu5IYBdhLWcg3wkH6P。你可能想尝试将`aoffsets[DOUBLE_MATCH_MASK]`拆分成两个数组,每个数组的长度为`MATCH_MASK`,并在处理之前对它们进行排序。这可能会显著提高分支预测的准确性。 - user1773602
1
为什么这个问题被关闭了?它是一个关于发现底层硬件对程序性能产生有趣影响的例子吗? - Philipp
显示剩余13条评论
2个回答

4

首先,使用 memaligned 缓冲区的 memset(count_vec,0, N); 比 std::vector 快 30%。

您可以尝试使用无分支表达式 (aoffsets[i] * boffsets[i]) 并同时计算一些不需要使用的表达式: offset = aoffset[i]-boffset[i]; offset+bsize; offset+n_maxoffset;

根据 offset 的典型范围,人们可能会想到计算 (offset+bsize) 的最小/最大值以限制下一次迭代所需的 memset(count_vec):没有必要清除已经为零的值。

正如 philipp 所指出的那样,最好交错操作 -- 然后,可以从 uint32_t aboffset[N]; 同时读取 aoffset[i] 和 boffset[i],通过一些巧妙的位掩码(生成用于:aoffset[i],aoffset[i+1] 的更改掩码),可以在纯 c 中使用 64 位模拟 SIMD 处理 2 组数据(直到直方图累加部分)。


表达式(aoffsets [i] * boffsets [i])比(aoffsets [i] && boffsets [i])更快,gcc大约快了百分之十,8775毫秒对9800毫秒,但在vs2012上慢了百分之十。 - magicyang

3

您可以通过减少缓存未命中来提高程序的速度:aoffsets[i]boffsets[i]在内存中相对较远。将它们放在一起,可以显著加快程序运行速度。在我的电脑上(e5400 CPU,VS2012),执行时间从3.0秒减少到2.3秒:

#include <vector>

#include <windows.h> 
#include <iostream> 


typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;

#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14; 
static const int MATCH_LEFT = (32 - MATCH_BITS); 
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)

static const int n_maxoffset = 1000;

uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
    uint16_t offsets[DOUBLE_MATCH_MASK] = {0}; 

    auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
    {
        for (int i = 0; i < n_size; ++i)
            offsets[MATCH_STRIP(x[i])*2 /*important. leave space for other offsets*/] = i;
    };
    fn_init_offsets(a, asize, offsets);
    fn_init_offsets(b, bsize, offsets+1);

    uint8_t topcount = 0;
    int topoffset = 0;
    {
        std::vector<uint8_t> count_vec(asize + bsize + 1, 0);   
        uint8_t* counts = &(count_vec[0]);
        for (int i = 0; i < MATCH_MASK; i+=2)
        {
            if (offsets[i] && offsets[i+1])  
            {
                int offset = (offsets[i] - offsets[i+1]); //NOTE: I removed 
                if ((-n_maxoffset <= offset && offset <= n_maxoffset))
                {
                    offset += bsize;
                    uint8_t n_cur_count = ++counts[offset];
                    if (n_cur_count > topcount)
                    {
                        topcount = n_cur_count;
                        topoffset = offset;
                    }
                }
            }
        }
    }
    return offsets[0];   
}


int main(int argc, char* argv[])
{
    const int sizes = 1000;
    int32_t* a = new int32_t[sizes];
    int32_t* b = new int32_t[sizes];
    for (int i=0;i<sizes;i++)
    {
        a[i] = rand()*rand();
        b[i] = rand()*rand();
    }

    //Variablen 
    LONGLONG g_Frequency, g_CurentCount, g_LastCount; 

    QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
    QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount); 

    int sum = 0;

    for (int i=0;i<100000;i++)
    {
        sum += test(a,sizes,b,sizes);
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount); 
    double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency)); 

    std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl; 


    delete[] a;
    delete[] b;
    return 0;
}

相对于您的test()版本,进行比较。

#include <vector>

#include <windows.h> 
#include <iostream> 


typedef unsigned short uint16_t;
typedef int int32_t;
typedef unsigned int uint32_t;
typedef unsigned char uint8_t;

#define MATCH_MASK 16383
#define DOUBLE_MATCH_MASK (MATCH_MASK*2)
static const int MATCH_BITS = 14; 
static const int MATCH_LEFT = (32 - MATCH_BITS); 
#define MATCH_STRIP(x) ((uint32_t)(x) >> MATCH_LEFT)
static const int n_maxoffset = 1000;

uint16_t test(int32_t* a, int asize, int32_t* b, int bsize)
{
    uint16_t aoffsets[DOUBLE_MATCH_MASK] = {0}; // important! or it will only be right on the first time
    uint16_t* boffsets = aoffsets + MATCH_MASK;

    auto fn_init_offsets = [](const int32_t* x, int n_size, uint16_t offsets[])->void
    {
        for (int i = 0; i < n_size; ++i)
            offsets[MATCH_STRIP(x[i])] = i;
    };
    fn_init_offsets(a, asize, aoffsets);
    fn_init_offsets(b, bsize, boffsets);

    uint8_t topcount = 0;
    int topoffset = 0;
    {
        std::vector<uint8_t> count_vec(asize + bsize + 1, 0);   
        uint8_t* counts = &(count_vec[0]);

        for (int i = 0; i < MATCH_MASK; ++i)
        {
            if (aoffsets[i] && boffsets[i]) 
            {

                int offset = (aoffsets[i] - boffsets[i]); //NOTE: I removed the -= because otherwise offset would always be positive!
                if ((-n_maxoffset <= offset && offset <= n_maxoffset))
                {
                    offset += bsize;
                    uint8_t n_cur_count = ++counts[offset];
                    if (n_cur_count > topcount)
                    {
                        topcount = n_cur_count;
                        topoffset = offset;
                    }
                }
            }
        }
    }
    return aoffsets[0];   
}


int main(int argc, char* argv[])
{
    const int sizes = 1000;
    int32_t* a = new int32_t[sizes];
    int32_t* b = new int32_t[sizes];
    for (int i=0;i<sizes;i++)
    {
        a[i] = rand()*rand();
        b[i] = rand()*rand();
    }

    LONGLONG g_Frequency, g_CurentCount, g_LastCount; 
    QueryPerformanceFrequency((LARGE_INTEGER*)&g_Frequency);
    QueryPerformanceCounter((LARGE_INTEGER*)&g_CurentCount); 

    int sum = 0;

    for (int i=0;i<100000;i++)
    {
        sum += test(a,sizes,b,sizes);
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&g_LastCount); 

    double dTimeDiff = (((double)(g_LastCount-g_CurentCount))/((double)g_Frequency)); 

    std::cout << "Result: " << sum << std::endl <<"time: " << dTimeDiff << std::endl; 


    delete[] a;
    delete[] b;
    return 0;
}

非常感谢!我使用了您的方法,将时间从大约8800毫秒减少到8199毫秒,快了约10%!因此,结合您和(aoffsets [i] * boffsets [i])的方法,该函数的速度已经提高了约20%。 - magicyang

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接