为什么gcc -march=znver1会限制uint64_t向量化?

3

我想确保gcc对我的循环进行矢量化。结果,通过使用-march=znver1(或者-march=native)选项,即使一些可以被矢量化的循环,gcc也会跳过它们。为什么会发生这种情况?

在这段代码中,第二个循环将每个元素乘以一个标量,但未被向量化:

#include <stdio.h>
#include <inttypes.h>

int main() {
    const size_t N = 1000;
    uint64_t arr[N];
    for (size_t i = 0; i < N; ++i)
        arr[i] = 1;

    for (size_t i = 0; i < N; ++i)
        arr[i] *= 5;

    for (size_t i = 0; i < N; ++i)
        printf("%lu\n", arr[i]); // use the array so that it is not optimized away
}

gcc -O3 -fopt-info-vec-all -mavx2 main.c:

main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: optimized: loop vectorized using 32 byte vectors
main.cpp:7:26: optimized: loop vectorized using 32 byte vectors
main.cpp:4:5: note: vectorized 2 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V4DI
main.cpp:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V4DI

gcc -O3 -fopt-info-vec-all -march=znver1 main.c:

main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: missed: couldn't vectorize loop
main.cpp:10:26: missed: not vectorized: unsupported data-type
main.cpp:7:26: optimized: loop vectorized using 16 byte vectors
main.cpp:4:5: note: vectorized 1 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V2DI
main.cpp:15:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI

-march=znver1 包括 -mavx2,因此我认为gcc由于某些原因选择不对其进行向量化:

~ $ gcc -march=znver1 -Q --help=target
The following options are target specific:
  -m128bit-long-double              [enabled]
  -m16                              [disabled]
  -m32                              [disabled]
  -m3dnow                           [disabled]
  -m3dnowa                          [disabled]
  -m64                              [enabled]
  -m80387                           [enabled]
  -m8bit-idiv                       [disabled]
  -m96bit-long-double               [disabled]
  -mabi=                            sysv
  -mabm                             [enabled]
  -maccumulate-outgoing-args        [disabled]
  -maddress-mode=                   long
  -madx                             [enabled]
  -maes                             [enabled]
  -malign-data=                     compat
  -malign-double                    [disabled]
  -malign-functions=                0
  -malign-jumps=                    0
  -malign-loops=                    0
  -malign-stringops                 [enabled]
  -mamx-bf16                        [disabled]
  -mamx-int8                        [disabled]
  -mamx-tile                        [disabled]
  -mandroid                         [disabled]
  -march=                           znver1
  -masm=                            att
  -mavx                             [enabled]
  -mavx2                            [enabled]
  -mavx256-split-unaligned-load     [disabled]
  -mavx256-split-unaligned-store    [enabled]
  -mavx5124fmaps                    [disabled]
  -mavx5124vnniw                    [disabled]
  -mavx512bf16                      [disabled]
  -mavx512bitalg                    [disabled]
  -mavx512bw                        [disabled]
  -mavx512cd                        [disabled]
  -mavx512dq                        [disabled]
  -mavx512er                        [disabled]
  -mavx512f                         [disabled]
  -mavx512ifma                      [disabled]
  -mavx512pf                        [disabled]
  -mavx512vbmi                      [disabled]
  -mavx512vbmi2                     [disabled]
  -mavx512vl                        [disabled]
  -mavx512vnni                      [disabled]
  -mavx512vp2intersect              [disabled]
  -mavx512vpopcntdq                 [disabled]
  -mavxvnni                         [disabled]
  -mbionic                          [disabled]
  -mbmi                             [enabled]
  -mbmi2                            [enabled]
  -mbranch-cost=<0,5>               3
  -mcall-ms2sysv-xlogues            [disabled]
  -mcet-switch                      [disabled]
  -mcld                             [disabled]
  -mcldemote                        [disabled]
  -mclflushopt                      [enabled]
  -mclwb                            [disabled]
  -mclzero                          [enabled]
  -mcmodel=                         [default]
  -mcpu=                            
  -mcrc32                           [disabled]
  -mcx16                            [enabled]
  -mdispatch-scheduler              [disabled]
  -mdump-tune-features              [disabled]
  -menqcmd                          [disabled]
  -mf16c                            [enabled]
  -mfancy-math-387                  [enabled]
  -mfentry                          [disabled]
  -mfentry-name=                    
  -mfentry-section=                 
  -mfma                             [enabled]
  -mfma4                            [disabled]
  -mforce-drap                      [disabled]
  -mforce-indirect-call             [disabled]
  -mfp-ret-in-387                   [enabled]
  -mfpmath=                         sse
  -mfsgsbase                        [enabled]
  -mfunction-return=                keep
  -mfused-madd                      -ffp-contract=fast
  -mfxsr                            [enabled]
  -mgeneral-regs-only               [disabled]
  -mgfni                            [disabled]
  -mglibc                           [enabled]
  -mhard-float                      [enabled]
  -mhle                             [disabled]
  -mhreset                          [disabled]
  -miamcu                           [disabled]
  -mieee-fp                         [enabled]
  -mincoming-stack-boundary=        0
  -mindirect-branch-register        [disabled]
  -mindirect-branch=                keep
  -minline-all-stringops            [disabled]
  -minline-stringops-dynamically    [disabled]
  -minstrument-return=              none
  -mintel-syntax                    -masm=intel
  -mkl                              [disabled]
  -mlarge-data-threshold=<number>   65536
  -mlong-double-128                 [disabled]
  -mlong-double-64                  [disabled]
  -mlong-double-80                  [enabled]
  -mlwp                             [disabled]
  -mlzcnt                           [enabled]
  -mmanual-endbr                    [disabled]
  -mmemcpy-strategy=                
  -mmemset-strategy=                
  -mmitigate-rop                    [disabled]
  -mmmx                             [enabled]
  -mmovbe                           [enabled]
  -mmovdir64b                       [disabled]
  -mmovdiri                         [disabled]
  -mmpx                             [disabled]
  -mms-bitfields                    [disabled]
  -mmusl                            [disabled]
  -mmwaitx                          [enabled]
  -mneeded                          [disabled]
  -mno-align-stringops              [disabled]
  -mno-default                      [disabled]
  -mno-fancy-math-387               [disabled]
  -mno-push-args                    [disabled]
  -mno-red-zone                     [disabled]
  -mno-sse4                         [disabled]
  -mnop-mcount                      [disabled]
  -momit-leaf-frame-pointer         [disabled]
  -mpc32                            [disabled]
  -mpc64                            [disabled]
  -mpc80                            [disabled]
  -mpclmul                          [enabled]
  -mpcommit                         [disabled]
  -mpconfig                         [disabled]
  -mpku                             [disabled]
  -mpopcnt                          [enabled]
  -mprefer-avx128                   -mprefer-vector-width=128
  -mprefer-vector-width=            128
  -mpreferred-stack-boundary=       0
  -mprefetchwt1                     [disabled]
  -mprfchw                          [enabled]
  -mptwrite                         [disabled]
  -mpush-args                       [enabled]
  -mrdpid                           [disabled]
  -mrdrnd                           [enabled]
  -mrdseed                          [enabled]
  -mrecip                           [disabled]
  -mrecip=                          
  -mrecord-mcount                   [disabled]
  -mrecord-return                   [disabled]
  -mred-zone                        [enabled]
  -mregparm=                        6
  -mrtd                             [disabled]
  -mrtm                             [disabled]
  -msahf                            [enabled]
  -mserialize                       [disabled]
  -msgx                             [disabled]
  -msha                             [enabled]
  -mshstk                           [disabled]
  -mskip-rax-setup                  [disabled]
  -msoft-float                      [disabled]
  -msse                             [enabled]
  -msse2                            [enabled]
  -msse2avx                         [disabled]
  -msse3                            [enabled]
  -msse4                            [enabled]
  -msse4.1                          [enabled]
  -msse4.2                          [enabled]
  -msse4a                           [enabled]
  -msse5                            -mavx
  -msseregparm                      [disabled]
  -mssse3                           [enabled]
  -mstack-arg-probe                 [disabled]
  -mstack-protector-guard-offset=   
  -mstack-protector-guard-reg=      
  -mstack-protector-guard-symbol=   
  -mstack-protector-guard=          tls
  -mstackrealign                    [disabled]
  -mstringop-strategy=              [default]
  -mstv                             [enabled]
  -mtbm                             [disabled]
  -mtls-dialect=                    gnu
  -mtls-direct-seg-refs             [enabled]
  -mtsxldtrk                        [disabled]
  -mtune-ctrl=                      
  -mtune=                           znver1
  -muclibc                          [disabled]
  -muintr                           [disabled]
  -mvaes                            [disabled]
  -mveclibabi=                      [default]
  -mvect8-ret-in-mem                [disabled]
  -mvpclmulqdq                      [disabled]
  -mvzeroupper                      [enabled]
  -mwaitpkg                         [disabled]
  -mwbnoinvd                        [disabled]
  -mwidekl                          [disabled]
  -mx32                             [disabled]
  -mxop                             [disabled]
  -mxsave                           [enabled]
  -mxsavec                          [enabled]
  -mxsaveopt                        [enabled]
  -mxsaves                          [enabled]

  Known assembler dialects (for use with the -masm= option):
    att intel

  Known ABIs (for use with the -mabi= option):
    ms sysv

  Known code models (for use with the -mcmodel= option):
    32 kernel large medium small

  Valid arguments to -mfpmath=:
    387 387+sse 387,sse both sse sse+387 sse,387

  Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
    keep thunk thunk-extern thunk-inline

  Known choices for return instrumentation with -minstrument-return=:
    call none nop5

  Known data alignment choices (for use with the -malign-data= option):
    abi cacheline compat

  Known vectorization library ABIs (for use with the -mveclibabi= option):
    acml svml

  Known address mode (for use with the -maddress-mode= option):
    long short

  Known preferred register vector length (to use with the -mprefer-vector-width= option):
    128 256 512 none

  Known stack protector guard (for use with the -mstack-protector-guard= option):
    global tls

  Valid arguments to -mstringop-strategy=:
    byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop

  Known TLS dialects (for use with the -mtls-dialect= option):
    gnu gnu2

  Known valid arguments for -march= option:
    i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 generic native

  Known valid arguments for -mtune= option:
    generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3

我也尝试了clang,两种情况下我相信循环被向量化为32字节向量:

注意:矢量化循环(矢量化宽度:4,交错计数:4)

我正在使用gcc 11.2.0版本

编辑: 根据Peter Cordes的要求, 我意识到我实际上一直在进行乘以4的基准测试。

Makefile:

all:
    gcc -O3 -mavx2 main.c -o 3
    gcc -O3 -march=znver2 main.c -o 32  
    gcc -O3 -march=znver2 main.c -mprefer-vector-width=128 -o 32128
    gcc -O3 -march=znver1 main.c -o 31
    gcc -O2 -mavx2 main.c -o 2
    gcc -O2 -march=znver2 main.c -o 22
    gcc -O2 -march=znver2 main.c -mprefer-vector-width=128 -o 22128
    gcc -O2 -march=znver1 main.c -o 21
    hyperfine -r5 ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21

clean:
    rm ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21

代码:

#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <time.h>

int main() {
    const size_t N = 500;
    uint64_t arr[N];
    for (size_t i = 0; i < N; ++i)
        arr[i] = 1;

    for (int j = 0; j < 20000000; ++j)
        for (size_t i = 0; i < N; ++i)
            arr[i] *= 4;

    srand(time(0));
    printf("%lu\n", arr[rand() % N]); // use the array so that it is not optimized away
}

N = 500, arr[i] *= 4:

Benchmark 1: ./3
  Time (mean ± σ):      1.780 s ±  0.011 s    [User: 1.778 s, System: 0.000 s]
  Range (min … max):    1.763 s …  1.791 s    5 runs

Benchmark 2: ./32
  Time (mean ± σ):      1.785 s ±  0.016 s    [User: 1.783 s, System: 0.000 s]
  Range (min … max):    1.773 s …  1.810 s    5 runs

Benchmark 3: ./32128
  Time (mean ± σ):      1.740 s ±  0.026 s    [User: 1.737 s, System: 0.000 s]
  Range (min … max):    1.724 s …  1.785 s    5 runs

Benchmark 4: ./31
  Time (mean ± σ):      1.757 s ±  0.022 s    [User: 1.754 s, System: 0.000 s]
  Range (min … max):    1.727 s …  1.785 s    5 runs

Benchmark 5: ./2
  Time (mean ± σ):      3.467 s ±  0.031 s    [User: 3.462 s, System: 0.000 s]
  Range (min … max):    3.443 s …  3.519 s    5 runs

Benchmark 6: ./22
  Time (mean ± σ):      3.475 s ±  0.028 s    [User: 3.469 s, System: 0.001 s]
  Range (min … max):    3.447 s …  3.512 s    5 runs

Benchmark 7: ./22128
  Time (mean ± σ):      3.464 s ±  0.034 s    [User: 3.459 s, System: 0.001 s]
  Range (min … max):    3.431 s …  3.509 s    5 runs

Benchmark 8: ./21
  Time (mean ± σ):      3.465 s ±  0.013 s    [User: 3.460 s, System: 0.001 s]
  Range (min … max):    3.443 s …  3.475 s    5 runs

N = 500, arr[i] *= 5:

Benchmark 1: ./3
  Time (mean ± σ):      1.789 s ±  0.004 s    [User: 1.786 s, System: 0.001 s]
  Range (min … max):    1.783 s …  1.793 s    5 runs

Benchmark 2: ./32
  Time (mean ± σ):      1.772 s ±  0.017 s    [User: 1.769 s, System: 0.000 s]
  Range (min … max):    1.755 s …  1.800 s    5 runs

Benchmark 3: ./32128
  Time (mean ± σ):      2.911 s ±  0.023 s    [User: 2.907 s, System: 0.001 s]
  Range (min … max):    2.880 s …  2.943 s    5 runs

Benchmark 4: ./31
  Time (mean ± σ):      2.924 s ±  0.013 s    [User: 2.921 s, System: 0.000 s]
  Range (min … max):    2.906 s …  2.934 s    5 runs

Benchmark 5: ./2
  Time (mean ± σ):      3.850 s ±  0.029 s    [User: 3.846 s, System: 0.000 s]
  Range (min … max):    3.823 s …  3.896 s    5 runs

Benchmark 6: ./22
  Time (mean ± σ):      3.816 s ±  0.036 s    [User: 3.812 s, System: 0.000 s]
  Range (min … max):    3.777 s …  3.855 s    5 runs

Benchmark 7: ./22128
  Time (mean ± σ):      3.813 s ±  0.026 s    [User: 3.809 s, System: 0.000 s]
  Range (min … max):    3.780 s …  3.834 s    5 runs

Benchmark 8: ./21
  Time (mean ± σ):      3.783 s ±  0.010 s    [User: 3.779 s, System: 0.000 s]
  Range (min … max):    3.773 s …  3.798 s    5 runs

N = 512时,arr[i]会乘以4。
Benchmark 1: ./3
  Time (mean ± σ):      1.849 s ±  0.015 s    [User: 1.847 s, System: 0.000 s]
  Range (min … max):    1.831 s …  1.873 s    5 runs

Benchmark 2: ./32
  Time (mean ± σ):      1.846 s ±  0.013 s    [User: 1.844 s, System: 0.001 s]
  Range (min … max):    1.832 s …  1.860 s    5 runs

Benchmark 3: ./32128
  Time (mean ± σ):      1.756 s ±  0.012 s    [User: 1.754 s, System: 0.000 s]
  Range (min … max):    1.744 s …  1.771 s    5 runs

Benchmark 4: ./31
  Time (mean ± σ):      1.788 s ±  0.012 s    [User: 1.785 s, System: 0.001 s]
  Range (min … max):    1.774 s …  1.801 s    5 runs

Benchmark 5: ./2
  Time (mean ± σ):      3.476 s ±  0.015 s    [User: 3.472 s, System: 0.001 s]
  Range (min … max):    3.458 s …  3.494 s    5 runs

Benchmark 6: ./22
  Time (mean ± σ):      3.449 s ±  0.002 s    [User: 3.446 s, System: 0.000 s]
  Range (min … max):    3.446 s …  3.452 s    5 runs

Benchmark 7: ./22128
  Time (mean ± σ):      3.456 s ±  0.007 s    [User: 3.453 s, System: 0.000 s]
  Range (min … max):    3.446 s …  3.462 s    5 runs

Benchmark 8: ./21
  Time (mean ± σ):      3.547 s ±  0.044 s    [User: 3.542 s, System: 0.001 s]
  Range (min … max):    3.482 s …  3.600 s    5 runs

N = 512, arr[i] *= 5

Benchmark 1: ./3
  Time (mean ± σ):      1.847 s ±  0.013 s    [User: 1.845 s, System: 0.000 s]
  Range (min … max):    1.836 s …  1.863 s    5 runs

Benchmark 2: ./32
  Time (mean ± σ):      1.830 s ±  0.007 s    [User: 1.827 s, System: 0.001 s]
  Range (min … max):    1.820 s …  1.837 s    5 runs

Benchmark 3: ./32128
  Time (mean ± σ):      2.983 s ±  0.017 s    [User: 2.980 s, System: 0.000 s]
  Range (min … max):    2.966 s …  3.012 s    5 runs

Benchmark 4: ./31
  Time (mean ± σ):      3.026 s ±  0.039 s    [User: 3.021 s, System: 0.001 s]
  Range (min … max):    2.989 s …  3.089 s    5 runs

Benchmark 5: ./2
  Time (mean ± σ):      4.000 s ±  0.021 s    [User: 3.994 s, System: 0.001 s]
  Range (min … max):    3.982 s …  4.035 s    5 runs

Benchmark 6: ./22
  Time (mean ± σ):      3.940 s ±  0.041 s    [User: 3.934 s, System: 0.001 s]
  Range (min … max):    3.890 s …  3.981 s    5 runs

Benchmark 7: ./22128
  Time (mean ± σ):      3.928 s ±  0.032 s    [User: 3.922 s, System: 0.001 s]
  Range (min … max):    3.898 s …  3.979 s    5 runs

Benchmark 8: ./21
  Time (mean ± σ):      3.908 s ±  0.029 s    [User: 3.904 s, System: 0.000 s]
  Range (min … max):    3.879 s …  3.954 s    5 runs

我认为使用 -O2 -march=znver1-O3 -march=znver1 的运行速度相同是我的问题,原因是文件命名错误,当时我还没有创建 makefile 文件,而是使用了 shell 历史记录。


“-O2” 不包括“-ftree-vectorize”(直到GCC12),因此无论使用哪种“-march”选项,“-O2”结果都大致相同,这并不奇怪。为了进一步减少开销,您可以在main函数中以“return arr [argc];”或将其分配给“volatile uint64_t”来结束。就编译器而言,它仍然是任何元素,并且不会进行额外的系统调用,特别是不会打印到终端。这很好,但如果您要向https://gcc.gnu.org/bugzilla/提交一个错过优化的错误报告,您可以通过这种方式加强它。当然,“srand”是不需要的。 - Peter Cordes
有趣的是,对于*=4情况(它将矢量化为Zen1),-O3 -march=znver2(256位向量)比-O3 -march=znver1(128位向量)略慢,这似乎验证了GCC的调优选择。数组上的_Alignas(32)是否可以恢复-O3 -march=znver2 *=4情况下的性能?如果是这样,那就表明即使每半个都是16字节对齐,Zen1上的未对齐32字节存储(甚至加载)也具有额外的成本。我希望这不是这种情况,因为它们作为单独的uop运行,而Sandybridge则使用相同的uop一次使用AGU。 - Peter Cordes
它没有改变任何东西,使用N = 500和N = 512进行了测试。 - TheHardew
感谢您的查看;我原本期望在Zen1上使用256位向量的“循环展开”效果至少不会有影响,但显然还是有一点影响的。 - Peter Cordes
1个回答

4
默认的-mtune=generic-mprefer-vector-width=256,而-mavx2并不改变。
znver1意味着-mprefer-vector-width=128,因为这是HW的本机宽度。使用32字节YMM向量的指令解码至少为2个uop,如果是跨越通道的洗牌,则需要更多。对于这种简单的垂直SIMD,32字节向量是可以接受的;流水线能够有效地处理2个uop的指令。(我认为它是6 uops宽但只有5条指令宽, 因此只使用1-uop指令时最大前端吞吐量不可用)。但当矢量化需要洗牌时(例如,在具有不同元素宽度的数组中),GCC代码生成可能会变得混乱,使用256位或更宽的向量。 vmovdqa ymm0, ymm1移动消除在Zen1上仅适用于低128位半部分。而且,通常情况下使用256位向量会暗示您应该随后使用vzeroupper,以避免在其他CPU上出现性能问题(但不包括Zen1)。
我不知道Zen1如何处理未对齐的32字节加载/存储,每个16字节的一半是对齐的,但在单独的高速缓存行中。如果性能良好,则GCC可能会考虑将znver1 -mprefer-vector-width增加到256。但更宽的向量意味着如果大小未知,则需要更多的清理代码与向量宽度的倍数相关。
理想情况下,GCC将能够检测到这种简单情况并在其中使用256位向量。 (纯垂直,没有混合元素宽度,大小恒定为32个字节的倍数)。至少在那些处理器上是可以接受的:比如znver1,但却无法在bdver2上使用,因为由于CPU设计缺陷,256位存储总是很慢。
您可以看到此选择的结果,即它如何用vmovdqu [rdx], xmm0把像memset一样的循环向量化:https://godbolt.org/z/E5Tq7Gfzc
因此,鉴于GCC已决定仅使用128位向量,该向量只能容纳两个uint64_t元素,它(正确或错误地)决定不值得使用vpsllq / vpaddd来实现qword* 5,而是使用一个LEA指令在整数中进行操作。
几乎可以确定在这种情况下是错误的,因为它仍然需要为每个元素或每对元素进行单独的加载和存储。(以及循环开销,因为GCC的默认值不是展开循环,除非使用PGO -fprofile-use。SIMD就像循环展开一样,特别是在处理256位向量作为2个单独的uop的CPU上。)
我不确定GCC通过“未向量化:不支持的数据类型”具体意味着什么。x86没有SIMD uint64_t乘法指令,直到AVX-512,所以也许GCC基于一般情况,使其具有模拟多个32x32 => 64位pmuludq 指令和一堆洗牌的成本。只有过了这个难关,它才会意识到像5这样的常数实际上非常便宜,只有2个集合位?
那解释GCC的决策过程,但我不确定这是否完全正确的解释。尽管如此,这些因素就是发生在编译器这样一个复杂机器里的因素。一个熟练的人可以轻松做出更明智的选择,但编译器只是执行一系列优化传递,不总是同时考虑大局和所有细节。

-mprefer-vector-width = 256 没有帮助:

未向量化 uint64_t *= 5 似乎是GCC9退化的结果

(问题中的基准测试证实了实际的Zen1 CPU获得了近2倍的加速,如预期的那样,在执行6个uop中进行2次uint64,而在标量执行5个uop时,在256位向量中进行4次uint64,包括两个128位存储,这将与前端一起成为吞吐瓶颈。)
即使使用 -march = znver1 -O3 -mprefer-vector-width = 256 ,我们也无法使用GCC9、10或11或当前主干向量化 *= 5 循环。正如您所说,我们可以使用 -march = znver2 https://godbolt.org/z/dMTh7Wxcq 对于 uint32_t ,我们使用这些选项进行向量化 (即使将向量宽度保留在128位)。标量每个向量uop(不是指令)的成本为4个操作,无论在Zen1上是128位还是256位矢量化,因此这并不告诉我们 *= 是否使成本模型决定不进行向量化,或只是每个128位内部uop的2 vs. 4元素。
使用 uint64_t,即使改成arr[i] += arr[i]<<2;,仍无法矢量化,但arr[i] <<= 1;可以。 (https://godbolt.org/z/6PMn93Y5G) 即使在同一循环中,arr[i] <<= 2;arr[i] += 123也会矢量化为相同的指令,尽管GCC认为这并不值得矢量化*= 5,只是操作数不同,一个是常数变量而非原始向量。(标量仍可使用一个LEA) 因此,显然成本模型并没有考虑最终的x86汇编指令,但我不知道为什么arr[i] += arr[i]会被认为比arr[i] <<= 1;更昂贵,尽管它们实际上是完全相同的事情。 GCC8甚至可以将您的循环矢量化,即使使用128位矢量宽度:https://godbolt.org/z/5o6qjc7f6
# GCC8.5 -march=znver1 -O3  (-mprefer-vector-width=128)
.L12:                                            # do{
        vmovups xmm1, XMMWORD PTR [rsi]            # 16-byte load
        add     rsi, 16                             # ptr += 2 elements
        vpsllq  xmm0, xmm1, 2                      # v << 2
        vpaddq  xmm0, xmm0, xmm1                   # tmp += v
        vmovups XMMWORD PTR [rsi-16], xmm0         # store
        cmp     rax, rsi
        jne     .L12                             # } while(p != endp)

使用-march=znver1 -mprefer-vector-width=256,将存储操作拆分为两个16字节的半部分,并使用vmovups xmm / vextracti128实现。参见为什么GCC不能将_mm256_loadu_pd解析为单个vmovupd?。znver1意味着-mavx256-split-unaligned-store(当GCC不能确定数据是否对齐时,它会影响每个存储操作。因此,即使数据确实对齐,它也会产生额外的指令)。

但是,znver1并不意味着-mavx256-split-unaligned-load,所以在有用的代码中,GCC愿意将内存源操作数的加载折叠到ALU操作中。


如果我创建一个单独的数组,填充它们为5并将它们相乘,那么它也能成功。 - TheHardew
1
@TheHardew:好的,听起来很合理,尽管将大小增加那么多意味着您现在正在测试L3带宽,而不是ALU /前端吞吐量。适合L1d的大小很好。如果您可以确认使用更大的大小进行加速(例如,使用-march=znver2与/或-mprefer-vector-width=128来证明此循环的128位矢量化是Zen1上的优势),则可以在https://gcc.gnu.org/bugzilla/上报告未优化错误。 - Peter Cordes
1
它并没有改变任何东西。另外,我意识到我犯了一个错误。数组现在是4000 B。使用4096 B,znver2 256位和通用mavx2实际上会减慢1%(5个标准偏差)。但不包括zenvr2 128位和znver1。 - TheHardew
1
@TheHardew:什么??所以自动向量化(在-O3启用但不在-O2启用)使您的程序在使用“-march=znver2”时在Zen1 CPU上运行速度提高了一倍?您是否在基准测试中重复了“= 1”初始化循环以及“*= 5”循环?现在我不确定您之前比较的是什么,因为“-O3 -march=znver2 -mprefer-vector-width=128”与“znver1”应该是相同的SIMD与标量(分别)基准测试,而您说在Zen1上没有加速。也许编辑您的问题并详细说明您测试的内容以及绝对时间。 - Peter Cordes
1
我更新了帖子,写评论时犯了一些错误,比如我把 *5 改成了 *4。看起来在 znver1 上的 gcc 真的存在一个 bug。非常抱歉占用了您这么多时间,非常感谢您的帮助。 - TheHardew
显示剩余17条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接