why does gcc auto-vectorization for tigerlake use ymm not zmm registers

Question

I wanted to explore auto-vectorization by gcc (10.3). I have the following short program (see https://godbolt.org/z/5v9a53aj6) which computes the sum of all elements of a vector:

#include <stdio.h>
#define LEN 1024

// -ffast-math -march=tigerlake -O3 -fno-unroll-loops
  
int
main()
{
  float v[LEN] __attribute__ ((aligned(64)));
  float s = 0;
  for (unsigned int i = 0; i < LEN; i++) s += v[i];
  printf("%g\n", s);
  return 0;
}

I compile with the options -ffast-math -march=tigerlake -O3 -fno-unroll-loops. Since tigerlake processors have avx512, I would expect that gcc autovectorization uses zmm registers, but it actually uses ymm registers (avx/avx2) in the innermost loop:

vaddps  ymm0, ymm0, YMMWORD PTR [rax]

If I replace -march=tigerlake with -mavx512f, zmm registers are used:

vaddps  zmm0, zmm0, ZMMWORD PTR [rax]

Why aren't zmm registers used, if I just specify -march=tigerlake?

Try `-mprefer-vector-width=512`? Maybe using the avx512 instructions often results in slower code for this processor. — Marc Glisse, Oct 21 '22 at 10:53
@MarcGlisse: Thanks a lot! Including this option produces code with zmm registers. — Ralf, Oct 21 '22 at 11:56

score 8 · Accepted Answer · answered Oct 21 '22 at 10:54

-march=tigerlake defaults to -mprefer-vector-width=256 because there are tradeoffs to actually using 512-bit vectors, unlike other AVX-512 features like masking and new instructions.

For a program that you hope might benefit, try compiling with -mprefer-vector-width=512. (And all the same other options, like -march=native -O3 -flto -ffast-math or -fno-math-errno -fno-trapping-math, and ideally -fprofile-generate / -fprofile-use.)

In your case, you're mostly going to bottleneck on page faults because you loop over some uninitialized stack memory, only once without warm-up. (Or your loop will be too short to time.) I hope that was just to demo how it auto-vectorized, not a micro-benchmark.
Idiomatic way of performance evaluation?

Most programs spend significant fractions of their time in code that doesn't auto-vectorize, so lowering max turbo isn't worth it by default. See SIMD instructions lowering CPU frequency

The frequency downside is small on Ice Lake client (non-server) CPUs, but does still exist on most so there's still at least a short stall while frequency transitions, if it had been running at max turbo. And at least a few percent downside in frequency for the whole program, including non-vectorized code, and for anything else running on the CPU.

The benefit of 512-bit vectors isn't as big as you'd hope for FP throughput: Ice/Tiger Lake client CPUs only have 1/clock throughput for 512-bit FMA/add/mul (combining the two halves of the normal 256-bit FMA/add/mul units), not having the extra 512-bit FMA unit on port 5 that some Skylake-X and Ice Lake Xeon CPUs have.

(Integer SIMD throughput could sometimes benefit more, since most integer instructions do have 2/clock throughput at 512-bit. Not 3/clock like you get with 256-bit vectors; having any 512-bit uop in the pipeline disables the vector ALUs on port 1, not just the FMA unit. So SIMD uop throughput is reduced, which can reduce the speedup for code with good computational intensity that doesn't spend a lot of time loading/storing.)

512-bit vectors are more sensitive to alignment, even for loops that bottleneck on DRAM bandwidth (where 256-bit vectors could easily keep up with available off-core bandwidth). So you can get maybe a 10 to 15% regression vs. 256-bit vectors in a loop over a big unaligned array that's not cache blocked. With 256-bit vectors, misaligned data only costs maybe 1 or 2% vs. aligned when looping over a big array. At least that was true on SKX; I haven't heard if that changed on ICL / ICX.

(Misalignment isn't great when data is hot in L1d cache; every other load being misaligned does hurt cache throughput. But some real-world code isn't well tuned with cache-blocking, or has parts that weren't amenable to it, so performance with cache-miss loads matters, too.)

Glibc's default malloc likes to do big allocations by grabbing some fresh pages from the OS and using the first 16 bytes for bookkeeping info about them, so you always get the worst case for alignment, ptr % 4096 == 16. The required alignment is 64, or 32 if you only use 256-bit vectors.

See also some specific discussions of compiler tuning defaults, at least for clang where they adopted the same -mprefer-vector-width=256 default for -march=icelake-client as GCC.

https://reviews.llvm.org/D111029#3674440 2021 Oct and 2022 Jun - discussion of (not) bumping up the vector width on Ice Lake client or server because the frequency penalty is smaller. Still turned out not to be worth it, 1% regression on SPEC CPU 2017 on Icelake Server, in Intel's testing of clang -mprefer-vector-width=512 vs. the current default 256.
https://reviews.llvm.org/D67259 2019 discussion of deciding to follow GCC's lead and limit to 256, for skylake-avx512, icelake-client, and icelake-server, etc. (But not of course KNL which doesn't even have AVX-512VL.)

Outside the L1D cache data is transferred in 64B cache lines anyway, so (to first order) 512-bit vectors in AVX512 only help if almost all of the data is in the L1D cache. In this case 512-bit vectors double the available load/store BW and SIMD execution bandwidth per cycle (-- slightly less than double when the frequency decrease is taken into account). For data coming from the L2 cache, read BW drops by a factor of 2-3, giving enough extra cycles to perform the same arithmetic using 256-bit vectors. — John D McCalpin, Oct 21 '22 at 14:47

Vladislav Kogan · Answer 2 · 2023-06-21T17:46:41.750

There is some autovectorization to 512-bit width registers by default in newest gcc-13/clang-16 for AMD Zen4 processors.

Using -march=zenvr4 your example compiles to:

vaddps  zmm0, zmm0, ZMMWORD PTR [rax]

Nevertheless, Intel processors (including newest Sapphire Rapids) still behave the same as Tiger Lake. The reason for this is that contrary to Intel AMD processors don't lower the frequency using SIMD instructions, thus no reason to avoid it. Bottleneck Zen CPUs have, however, is the number of instructions to decode. So using AVX-512 can give a performance improvement since it would require less μοps do to the same amount of job, as noted by Agner Fog.

why does gcc auto-vectorization for tigerlake use ymm not zmm registers

2 Answers2

Linked