针对基于ARM架构的设备进行C代码优化

Question

针对基于ARM架构的设备进行C代码优化

3

最近，我遇到了一道面试题，需要编写一段针对ARM优化的代码，特别是针对iPhone：

编写一个函数，该函数接受一个char（ASCII符号）数组，并查找出现频率最高的字符。

char mostFrequentCharacter(char* str, int size)

该函数应该经过优化，以在双核基于ARM的处理器上运行，并具有无限的内存。

乍一看，问题本身似乎很简单，下面是我头脑中想出的简单实现函数：

#define RESULT_SIZE 127

inline int set_char(char c, int result[])
{
    int count = result[c];
    result[c] = ++count;
    return count;
}

char mostFrequentChar(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char current_char;
    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(size_t i = 0; i<size; i++)
    {
        current_char = str[i];
        current_char_frequency = set_char(current_char, result);

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = current_char;
        }
    }

    return frequent_char;
}

首先，我进行了基本的代码优化；将每次迭代计算最常见字符的代码移动到另一个for循环中，并获得了显著的速度提升，而不是对以下代码块进行size次评估。

if(current_char_frequency >= char_frequency)
{
    char_frequency = current_char_frequency;
    frequent_char = current_char;
}

我们可以在O(RESULT_SIZE)的时间复杂度内找到最常见的字符，其中RESULT_SIZE == 127。

char mostFrequentCharOpt1(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(int i = 0; i<size; i++)
    {
        set_char(str[i], result);
    }

    for(int i = 0; i<RESULT_SIZE; i++)
    {
        current_char_frequency = result[i];

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    return frequent_char;
}

基准测试：iPhone 5s

size = 1000000
iterations = 500

// seconds = 7.842381
char mostFrequentChar(char str[], int size)

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

平均而言，mostFrequentCharOpt1 的工作速度比基本实现快约24%。 类型优化 ARM 核心寄存器的长度为32位。因此，将所有具有 char 类型的局部变量更改为 int 类型，可以防止处理器在每次赋值后执行额外的指令来考虑局部变量的大小。

注意：ARM64 提供了31个寄存器（x0-x30），每个寄存器宽度为64位，并且还有一个32位形式（w0-w30）。因此，没有必要对 int 数据类型进行特殊操作。 infocenter.arm.com - ARMv8 Registers 在比较汇编语言版本的函数时，我注意到 ARM 处理 int 类型和 char 类型的方式之间存在差异。ARM 使用 LDRB 指令将字节加载到内存中的单个字节，并使用 STRB 指令将字节存储到内存中的单个字节。因此，在我看来，LDRB 比 LDR 稍慢一些，因为每次访问内存并加载到寄存器时，LDRB 都会进行零扩展。换句话说，我们不能只将一个字节加载到32位寄存器中，我们应该将字节转换为单词。

基准测试：iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)

将 char 类型转换为 int 并没有显著提高 iPhone 5s 的速度，相比之下，在 iPhone 4 上运行相同的代码则有不同的结果：

基准测试：iPhone 4

size = 1000000
iterations = 500

// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)

// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)

循环优化

接下来，我进行了一次循环优化，不再使用i的自增，而是使用自减。

before    
for(int i = 0; i<size; i++) { ... }

after
for(int i = size; i--) { ... }

再次比较汇编代码，让我清楚地区分了这两种方法。

mostFrequentCharOpt2                                              |      mostFrequentCharOpt3
0x10001250c <+88>:  ldr    w8, [sp, #28] ; w8 = i                 |      0x100012694 <+92>:  ldr    w8, [sp, #28]                                             ; w8 = i
0x100012510 <+92>:  ldr    w9, [sp, #44] ; w9 = size              |      0x100012698 <+96>:  sub    w9, w8, #1 ; w9 = i - 1                                           
0x100012514 <+96>:  cmp    w8, w9 ; if i<size                     |      0x10001269c <+100>: str    w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge   0x100012548 ; if true => end loop      |      0x1000126a0 <+104>: cbz    w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine                    |      0x1000126a4 <+108>: ... set_char start routine
...                                                               |      ...
0x100012534 <+128>: ... set_char end routine                      |      0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr    w8, [sp, #28] ; w8 = i                 |      0x1000126c0 <+136>: b      0x100012694 ; back to the first line
0x10001253c <+136>: add    w8, w8, #1 ; i++                       |      0x1000126c4 <+140>: ...
0x100012540 <+140>: str    w8, [sp, #28] ; save i to $sp+28       |      
0x100012544 <+144>: b      0x10001250c ; back to the first line   |      
0x100012548 <+148>: str    ...                                    |

在这里，我们不再从内存中访问size并将其与增量变量i进行比较，而是将i减去0x1并将寄存器（其中存储了i）与0进行比较。

基准测试：iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization

// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization

线程优化

准确理解问题至少可以给我们提供一种更多的优化方式。尤其是这一行 ..优化为在双核基于 ARM 的处理器上运行...，提示我们使用 pthread 或 gcd 优化代码。

int mostFrequentCharThreadOpt(char str[], int size)
{
    int s;
    int tnum;
    int num_threads = THREAD_COUNT; //by default 2
    struct thread_info *tinfo;

    tinfo = calloc(num_threads, sizeof(struct thread_info));

    if (tinfo == NULL)
        exit(EXIT_FAILURE);

    int minCharCountPerThread = size/num_threads;
    int startIndex = 0;

    for (tnum = num_threads; tnum--;)
    {
        startIndex = minCharCountPerThread*tnum;

        tinfo[tnum].thread_num = tnum + 1;
        tinfo[tnum].startIndex = minCharCountPerThread*tnum;
        tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
        tinfo[tnum].str = str;

        s = pthread_create(&tinfo[tnum].thread_id, NULL,
                           (void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
        if (s != 0)
            exit(EXIT_FAILURE);
    }

    int frequent_char = 0;
    int char_frequency = 0;
    int current_char_frequency = 0;

    for (tnum = num_threads; tnum--; )
    {
        s = pthread_join(tinfo[tnum].thread_id, NULL);
    }

    for(int i = RESULT_SIZE; i--; )
    {
        current_char_frequency = 0;

        for (int z = num_threads; z--;)
        {
            current_char_frequency += tinfo[z].resultArray[i];
        }

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    free(tinfo);

    return frequent_char;
}

基准测试：iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

注意：mostFrequentCharThreadOpt在iPhone 4上比mostFrequentCharOpt2运行得更慢。

基准测试：iPhone 4

size = 1000000
iterations = 500

// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

问题

mostFrequentCharOpt3和mostFrequentCharThreadOpt的优化程度如何？换句话说，是否有其他方法可以优化这两种方法？

源代码

- tikhop

2

这些手机有多少个核心，每个核心有多少缓存？一旦主内存带宽成为瓶颈，那么使用额外的线程将无法提高性能，反而会使算法变慢。 - rcgldr

你尝试过在线程中使用GCD吗？（它在技术上是C语言）它可能比手动使用pthread更好。 - Fonix

@auselen 嗯，如果我想看到优化的结果，我应该使用哪个优化级别？ - tikhop

-O2或-O3是最容易使用的选项。 - koldewb

关闭优化选项会让编译器在您的代码中填充一些混乱，这并非毫无根据，但更多地用于调试。尝试切换这个选项并查看生成的汇编代码 http://goo.gl/N0T048 - auselen

显示剩余8条评论

2个回答

1

有很多你可以做的事情，但结果真正取决于代码运行在哪种具体的ARM硬件上。例如，旧款iPhone硬件与新款64位设备完全不同。硬件架构和指令集完全不同。旧的32位ARM硬件包含一些真正的“诀窍”，可以使事情变得更快，例如，不是加载字节而是加载32位字，然后使用位移在寄存器中对每个字节进行操作。如果您正在使用2个线程，则另一种方法是将内存访问分解为1个线程处理1个内存页，然后第二个线程在第2个内存页上操作，依此类推。这样，不同处理器中的不同寄存器可以在不读取或写入相同内存页的情况下进行最大的计算（通常是内存访问慢的部分）。我还建议您从一个好的计时框架开始，我为ARM+iOS构建了一个计时框架，您可能会发现它对此目的有用。

- MoDJ

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- koldewb · Accepted Answer

好的，以下是你可以尝试的事情，我无法百分之百确定什么对你的情况有效，但从经验来看，如果你关闭所有可能的优化，并且考虑到即使循环优化也对你有用的事实：你的编译器相当迟钝。

这在一定程度上取决于你的THREAD_COUNT，你说默认值为2，但如果你确定是2，那么你可能能节省一些时间。你知道你所使用的平台，如果速度是你的优先考虑因素，请不要随意进行任何动态操作。

如果THREAD == 2，则num_threads是一个不必要的变量，可以删除。

int minCharCountPerThread = size/num_threads;

对于许多关于位移的讨论，旧的方法是尝试它：

int minCharCountPerThread = size >> 1; //divide by 2

下一步可以尝试的是展开循环：多个循环仅使用2次，如果大小不是问题，为什么不去掉循环方面呢？这确实是你应该尝试的事情，看看会发生什么，如果对你有用的话。我见过循环展开效果很好的情况，也见过循环展开减慢我的代码的情况。

最后一点：尝试使用无符号数字而不是带符号/整数（除非你真的需要带符号）。众所周知，某些技巧/指令仅适用于无符号变量。