这段代码处理大小为4k的数据...它将每3个连续字节相加,并将结果存储在一个大小为4k的临时缓冲区中。该临时缓冲区用于生成直方图。
使用SIMD指令可以对3个连续字节进行向量化处理。
根据Dietrich的建议,如果不生成直方图,而是简单地将临时缓冲区中的值相加,执行速度非常快。但生成直方图是需要时间的部分。我使用cache grind对代码进行了分析...输出如下:
==11845==
==11845== I refs: 212,171
==11845== I1 misses: 842
==11845== LLi misses: 827
==11845== I1 miss rate: 0.39%
==11845== LLi miss rate: 0.38%
==11845==
==11845== D refs: 69,179 (56,158 rd + 13,021 wr)
==11845== D1 misses: 2,905 ( 2,289 rd + 616 wr)
==11845== LLd misses: 2,470 ( 1,895 rd + 575 wr)
==11845== D1 miss rate: 4.1% ( 4.0% + 4.7% )
==11845== LLd miss rate: 3.5% ( 3.3% + 4.4% )
==11845==
==11845== LL refs: 3,747 ( 3,131 rd + 616 wr)
==11845== LL misses: 3,297 ( 2,722 rd + 575 wr)
==11845== LL miss rate: 1.1% ( 1.0% + 4.4% )
完整的输出结果为:
I1 cache: 65536 B, 64 B, 2-way associative
D1 cache: 65536 B, 64 B, 2-way associative
LL cache: 1048576 B, 64 B, 16-way associative
Command: ./a.out
Data file: cachegrind.out.11845
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
212,171 842 827 56,158 2,289 1,895 13,021 616 575 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
97,335 651 642 26,648 1,295 1,030 10,883 517 479 ???:???
59,413 13 13 13,348 886 829 17 1 0 ???:_dl_addr
40,023 7 7 12,405 10 8 223 18 17 ???:core_get_signature
5,123 2 2 1,277 64 19 256 64 64 ???:core_get_signature_parallel
3,039 46 44 862 9 4 665 8 8 ???:vfprintf
2,344 11 11 407 0 0 254 1 1 ???:_IO_file_xsputn
887 7 7 234 0 0 134 1 0 ???:_IO_file_overflow
720 9 7 250 5 2 150 0 0 ???:__printf_chk
538 4 4 104 0 0 102 2 2 ???:__libc_memalign
507 6 6 145 0 0 114 0 0 ???:_IO_do_write
478 2 2 42 1 1 0 0 0 ???:strchrnul
350 3 3 80 0 0 50 0 0 ???:_IO_file_write
297 4 4 98 0 0 23 0 0 ???:_IO_default_xsputn
register
关键字,而counter[arr[i]]++
更易读(其结果代码相同)。 - Dietrich Eppfor
循环,但是你又撤销了它。这有意义吗? - Jared Farrishcounter
只有512字节,所以它具有空间局部性,可以轻松地适应L1缓存。 - Dietrich Epp