最近,我遇到了一道面试题,需要编写一段针对ARM优化的代码,特别是针对iPhone:
编写一个函数,该函数接受一个char(ASCII符号)数组,并查找出现频率最高的字符。
char mostFrequentCharacter(char* str, int size)
该函数应该经过优化,以在双核基于ARM的处理器上运行,并具有无限的内存。
乍一看,问题本身似乎很简单,下面是我头脑中想出的简单实现函数:
#define RESULT_SIZE 127
inline int set_char(char c, int result[])
{
int count = result[c];
result[c] = ++count;
return count;
}
char mostFrequentChar(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char current_char;
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(size_t i = 0; i<size; i++)
{
current_char = str[i];
current_char_frequency = set_char(current_char, result);
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
}
return frequent_char;
}
首先,我进行了基本的代码优化;将每次迭代计算最常见字符的代码移动到另一个
for
循环中,并获得了显著的速度提升,而不是对以下代码块进行size
次评估。if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
我们可以在
O(RESULT_SIZE)
的时间复杂度内找到最常见的字符,其中RESULT_SIZE == 127
。char mostFrequentCharOpt1(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(int i = 0; i<size; i++)
{
set_char(str[i], result);
}
for(int i = 0; i<RESULT_SIZE; i++)
{
current_char_frequency = result[i];
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
return frequent_char;
}
基准测试:iPhone 5s
size = 1000000
iterations = 500
// seconds = 7.842381
char mostFrequentChar(char str[], int size)
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
平均而言,
mostFrequentCharOpt1
的工作速度比基本实现快约24%。
类型优化
ARM 核心寄存器的长度为32位。因此,将所有具有 char 类型的局部变量更改为 int 类型,可以防止处理器在每次赋值后执行额外的指令来考虑局部变量的大小。注意:ARM64 提供了31个寄存器(x0-x30),每个寄存器宽度为64位,并且还有一个32位形式(w0-w30)。因此,没有必要对 int 数据类型进行特殊操作。 infocenter.arm.com - ARMv8 Registers 在比较汇编语言版本的函数时,我注意到 ARM 处理 int 类型和 char 类型的方式之间存在差异。ARM 使用 LDRB 指令将字节加载到内存中的单个字节,并使用 STRB 指令将字节存储到内存中的单个字节。因此,在我看来,LDRB 比 LDR 稍慢一些,因为每次访问内存并加载到寄存器时,LDRB 都会进行零扩展。换句话说,我们不能只将一个字节加载到32位寄存器中,我们应该将字节转换为单词。
基准测试:iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)
将 char
类型转换为 int
并没有显著提高 iPhone 5s 的速度,相比之下,在 iPhone 4 上运行相同的代码则有不同的结果:
基准测试:iPhone 4
size = 1000000
iterations = 500
// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)
// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)
循环优化
接下来,我进行了一次循环优化,不再使用i的自增,而是使用自减。
before
for(int i = 0; i<size; i++) { ... }
after
for(int i = size; i--) { ... }
再次比较汇编代码,让我清楚地区分了这两种方法。
mostFrequentCharOpt2 | mostFrequentCharOpt3
0x10001250c <+88>: ldr w8, [sp, #28] ; w8 = i | 0x100012694 <+92>: ldr w8, [sp, #28] ; w8 = i
0x100012510 <+92>: ldr w9, [sp, #44] ; w9 = size | 0x100012698 <+96>: sub w9, w8, #1 ; w9 = i - 1
0x100012514 <+96>: cmp w8, w9 ; if i<size | 0x10001269c <+100>: str w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge 0x100012548 ; if true => end loop | 0x1000126a0 <+104>: cbz w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine | 0x1000126a4 <+108>: ... set_char start routine
... | ...
0x100012534 <+128>: ... set_char end routine | 0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr w8, [sp, #28] ; w8 = i | 0x1000126c0 <+136>: b 0x100012694 ; back to the first line
0x10001253c <+136>: add w8, w8, #1 ; i++ | 0x1000126c4 <+140>: ...
0x100012540 <+140>: str w8, [sp, #28] ; save i to $sp+28 |
0x100012544 <+144>: b 0x10001250c ; back to the first line |
0x100012548 <+148>: str ... |
在这里,我们不再从内存中访问
size
并将其与增量变量i
进行比较,而是将i
减去0x1并将寄存器(其中存储了i
)与0进行比较。基准测试:iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization
// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization
线程优化
准确理解问题至少可以给我们提供一种更多的优化方式。尤其是这一行 ..优化为在双核基于 ARM 的处理器上运行...
,提示我们使用 pthread 或 gcd 优化代码。
int mostFrequentCharThreadOpt(char str[], int size)
{
int s;
int tnum;
int num_threads = THREAD_COUNT; //by default 2
struct thread_info *tinfo;
tinfo = calloc(num_threads, sizeof(struct thread_info));
if (tinfo == NULL)
exit(EXIT_FAILURE);
int minCharCountPerThread = size/num_threads;
int startIndex = 0;
for (tnum = num_threads; tnum--;)
{
startIndex = minCharCountPerThread*tnum;
tinfo[tnum].thread_num = tnum + 1;
tinfo[tnum].startIndex = minCharCountPerThread*tnum;
tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
tinfo[tnum].str = str;
s = pthread_create(&tinfo[tnum].thread_id, NULL,
(void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
if (s != 0)
exit(EXIT_FAILURE);
}
int frequent_char = 0;
int char_frequency = 0;
int current_char_frequency = 0;
for (tnum = num_threads; tnum--; )
{
s = pthread_join(tinfo[tnum].thread_id, NULL);
}
for(int i = RESULT_SIZE; i--; )
{
current_char_frequency = 0;
for (int z = num_threads; z--;)
{
current_char_frequency += tinfo[z].resultArray[i];
}
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
free(tinfo);
return frequent_char;
}
基准测试:iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
注意:mostFrequentCharThreadOpt在iPhone 4上比mostFrequentCharOpt2运行得更慢。
基准测试:iPhone 4
size = 1000000
iterations = 500
// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
问题
mostFrequentCharOpt3和mostFrequentCharThreadOpt
的优化程度如何?换句话说,是否有其他方法可以优化这两种方法?