为什么相同的gcc编译选项在不同的计算机架构上表现不同？

Question

为什么相同的gcc编译选项在不同的计算机架构上表现不同？

c++multithreadingcachinggccarchitecture

3

我使用以下两个makefile来编译我的程序以进行高斯模糊：

g++ -Ofast -ffast-math -march=native -flto -fwhole-program -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp
g++ -O3 -std=c++11 -fopenmp -o interpolateFloatImg interpolateFloatImg.cpp

我的两个测试环境是：

i7 4710HQ 4核8线程
E5 2650

然而，第一个输出在E5上比i7快2倍，在i7上只有0.5倍的速度。第二个输出在i7上表现更快，但在E5上较慢。

有人能给出一些解释吗？

这是源代码：https://github.com/makeapp007/interpolateFloatImg 我会尽快提供更多细节。

i7上的程序将在8个线程上运行。我不知道这个程序在E5上会生成多少个线程。

==== 更新 ====

我是该项目原作者的队友，以下是结果。

Arch-Lenovo-Y50 ~/project/ca/3/12 (git)-[master] % perf stat -d ./interpolateFloatImg lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 211199093
Performance counter stats for './interpolateFloatImg lobby.bin out.bin 255 20':
1423026.281358      task-clock:u (msec)       #    6.516 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
         2,604      page-faults:u             #    0.002 K/sec                  
4,167,572,543,807      cycles:u                  #    2.929 GHz                      (46.79%)
6,713,517,640,459      instructions:u            #    1.61  insn per cycle           (59.29%)
725,873,982,404      branches:u                #  510.092 M/sec                    (57.28%)
23,468,237,735      branch-misses:u           #    3.23% of all branches          (56.99%)
544,480,682,764      L1-dcache-loads:u         #  382.622 M/sec                    (37.00%)
545,000,783,842      L1-dcache-load-misses:u   #  100.10% of all L1-dcache hits    (31.44%)
38,696,703,292      LLC-loads:u               #   27.193 M/sec                    (26.68%)
1,204,703,652      LLC-load-misses:u         #    3.11% of all LL-cache hits     (35.70%)
218.384387536 seconds time elapsed

这些是工作站的结果：

workstation:~/mossCAP3/repos/liuyh1_liujzh/12$  perf stat -d ./interpolateFloatImg ../../../lobby.bin out.bin 255 20
Kernel kernelSize  : 255
Standard deviation : 20
Kernel maximum: 0.000397887
Kernel minimum: 1.22439e-21
Reading width 20265 height  8533 = 172921245
Micro seconds: 133661220
Performance counter stats for './interpolateFloatImg ../../../lobby.bin out.bin 255 20':
2035379.528531      task-clock (msec)         #   14.485 CPUs utilized          
         7,370      context-switches          #    0.004 K/sec                  
           273      cpu-migrations            #    0.000 K/sec                  
         3,123      page-faults               #    0.002 K/sec                  
5,272,393,071,699      cycles                    #    2.590 GHz                     [49.99%]
             0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
             0      stalled-cycles-backend    #    0.00% backend  cycles idle   
7,425,570,600,025      instructions              #    1.41  insns per cycle         [62.50%]
370,199,835,630      branches                  #  181.882 M/sec                   [62.50%]
47,444,417,555      branch-misses             #   12.82% of all branches         [62.50%]
591,137,049,749      L1-dcache-loads           #  290.431 M/sec                   [62.51%]
545,926,505,523      L1-dcache-load-misses     #   92.35% of all L1-dcache hits   [62.51%]
38,725,975,976      LLC-loads                 #   19.026 M/sec                   [50.00%]
 1,093,840,555      LLC-load-misses           #    2.82% of all LL-cache hits    [49.99%]
140.520016141 seconds time elapsed

====更新==== E5的规格：

workstation:~$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
     20  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
workstation:~$ dmesg | grep cache
[    0.041489] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
[    0.047512] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
[    0.050088] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    0.050121] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes)
[    0.558666] PCI: pci_cache_line_size set to 64 bytes
[    0.918203] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.948808] xhci_hcd 0000:00:14.0: cache line size of 32 is not supported
[    1.076303] ehci-pci 0000:00:1a.0: cache line size of 32 is not supported
[    1.089022] ehci-pci 0000:00:1d.0: cache line size of 32 is not supported
[    1.549796] sd 4:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.552711] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.552955] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

- makeapp

Makeapp，请问您能否在两个平台上分别运行 perf stat ./interpolateFloatImg 和 perf stat -d ./interpolateFloatImg 并将结果发布吗？这些结果将包含实际的 CPU 频率（行中的“cycles ... GHz”）。您的 E5 型号是什么（Xeon E5 CPU 有不同的版本：v1、v2、v3、v4）？没有源代码和详细的热点分析结果以及无法在自己的机器上重现测试的能力，没有人可以回答您的问题（http://stackoverflow.com/help/mcve - 在您的问题中没有最小、完整和可验证的示例）。 - osgx

Makeapp，感谢您的代码。您使用的操作系统是什么？您的gcc版本是多少（i7和E5是否相同）？您能提供运行代码的任何图像吗？内核大小和映像大小（args）是多少？perf stat的4个输出（system1上的program1，system2上的program1，system1上的program2，system2上的program2）以及perf stat -d的4个输出如何？您使用了omp parallel for，尝试限制线程数到相同的值（export OMP_NUM_THREADS=4）和/或export OMP_PROC_BIND=true了吗？ - osgx

对于i7，使用的是Arch Linux操作系统。对于E5，则是Ubuntu 14.04。对于E5，g++版本为4.8.2。对于i7，g++版本为6.1.1。内核大小为277 10，精度为0.002。测试输入图像为1000*1000。详细信息请参见http://shtech.org/course/ca/projects/3/。我没有限制线程数。 - makeapp

卷积核大小为277，标准偏差为10。我不确定E5电脑是否会限制线程数，这是我的老师的电脑。 - makeapp

制作应用，谢谢，我将向老师发送问题链接。你的输入图像和运行时间是多少？perf stat呢？ - osgx

你的E5-26*v3 是Haswell架构，它具有与i7-4*相同的AVX2向量扩展。请使用perf record/perf report对程序进行性能分析，找出热点（请查看我的更新答案）。并尝试重写参考程序。 - osgx

2个回答

3

你的程序缓存未命中率非常高。这对程序是好还是坏？

545,000,783,842 L1-dcache-load-misses:u # 所有L1-dcache命中的100.10%

545,926,505,523 L1-dcache-load-misses # 所有L1-dcache命中的92.35%

i7和E5的缓存大小可能不同，这是差异的一个来源。另一个是-不同的汇编代码、不同的gcc版本、不同的gcc选项。

你应该尝试查看代码内部，找到热点，分析每个命令处理多少像素以及如何更好地为CPU和内存排序。重写热点（运行时间最长的代码部分）是解决任务的关键 http://shtech.org/course/ca/projects/3/。

你可以使用perf分析器以record / report / annotate模式查找热点（如果你添加了-g选项重新编译项目，将更容易）：

# Profile program using cpu cycle performance counter; write profile to perf.data file
perf record ./test test_arg1 test_arg2
# Read perf.data file and report functions where time was spent 
#  (Do not change ./test file, or recompile it after record and before report)
perf report
# Find the hotspot in the top functions by annotation
#  you may use Arrows and Enter to do "annotate" action from report; or:
perf annonate -s top_function_name
perf annonate -s top_function_name > annotate_func1.txt

我能够在我的移动设备i5-4*（英特尔哈斯韦尔）上使用2个核心（启用HT后有4个虚拟核心）和AVX2+FMA，将小型bin文件和277 10参数的速度提高7倍。需要改写一些循环/循环嵌套。您应该了解CPU缓存的工作原理以及它更容易做到什么：经常错过还是不经常错过。此外，gcc可能会很愚蠢，并且可能无法始终检测到读取数据的模式；这种检测可能需要同时处理几个像素。

- osgx

非常感谢。我以前不知道有任何分析程序性能的工具。你说的“找到热点”是什么意思？我对如何将这些数据适应缓存感到困惑。另外，我使用SSE来提高性能，它提高了30%的时间。 - makeapp

请返回翻译后的文本：“perf annotate -s top_function_name” - makeapp

Makeapp，你不能只是“使用”SSE（哪一个？有SSE、SEE2、SSE3、AVX、AVX2、FMA、AVX512；一些是更宽的SIMD；请查看维基https://en.wikipedia.org/wiki/X86_instruction_listings#SIMD_instructions），你应该看看代码，它如何访问数据，它是高性能访问类型还是不是。然后你应该看看汇编器（使用了什么编译器）。在x86_64世界中，只有SSE2可以用于硬件中的浮点/双精度运算；但即使SSE2也可以用于标量操作（“标量”，ss后缀）或矢量化操作（“打包”，ps后缀）。优化程序是你的任务，而不是我的。 - osgx

1

@makeapp，请查看这个帖子 https://dev59.com/LGkw5IYBdhLWcg3wY5jq 当迭代二维数组时，为什么循环的顺序会影响性能？ - osgx

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pyves · Accepted Answer

根据您指定的编译器标志，第一个Makefile正在使用-march=native标志，这部分解释了为什么您在有或没有标志的两个CPU上观察到不同的性能差距。

此标志允许GCC使用特定于给定CPU架构的指令，这些指令在不同的架构上不一定可用。它还意味着-mtune=native，它调整编译后的代码以适应机器的特定CPU，并优先考虑在该CPU上运行更快的指令序列。请注意，使用-march=native编译的代码可能在其他CPU系统上根本无法工作，或者明显变慢。

因此，即使选项似乎相同，它们也会在幕后以不同方式运作，具体取决于您用于编译的机器。您可以在GCC文档中找到有关此标志的更多信息。

要查看每个CPU具体启用了哪些选项，可以在每台机器上运行以下命令：

gcc -march=native -Q --help=target

此外，不同版本的GCC也会对编译器标志如何优化您的代码产生影响，特别是-march=native标志，在旧版本的GCC上没有启用太多调整（在当时可能并不完全支持较新的架构）。这可以进一步解释您正在观察到的差距。