The x86-64 SysV ABI's strict rules allow implementations that only save the exact number of XMM regs specified, but current implementations only check for zero / non-zero because that's efficient, especially for the AL=0 common case.
If you pass a number in AL1 lower than the actual number of XMM register args, or a number higher than 8, you'd be violating the ABI, and it's only this implementation detail which stops your code from breaking. (i.e. it "happens to work", but is not guaranteed by any standard or documentation, and isn't portable to some other real implementations, like older GNU/Linux distros that were built with GCC4.5 or earlier.)
This Q&A shows a current build of glibc printf which just checks for AL!=0, vs. an old build of glibc which computes a jump target into a sequence of movaps stores. (That Q&A is about that code breaking when AL>8, making the computed jump go somewhere it shouldn't.)
Why does eax contain the number of vector parameters? quotes the ABI doc, and shows ICC code-gen which similarly does a computed jump using the same instructions as old GCC.
Glibc's printf implementation is compiled from C source, normally by GCC. When modern GCC compiles a variadic function like printf, it makes asm that only checks for a zero vs. non-zero AL, dumping all 8 arg-passing XMM registers to an array on the stack if non-zero.
GCC4.5 and earlier actually did use the number in AL to do a computed jump into a sequence of movaps stores, to only actually save as many XMM regs as necessary.
Nate's simple example from comments on Godbolt with GCC4.5 vs. GCC11 shows the same difference as the linked answer with disassembly of old/new glibc (built by GCC), unsurprisingly. This function only ever uses va_arg(v, double);, never integer types, so it doesn't dump the incoming RDI...R9 anywhere, unlike printf. And it's a leaf function so it can use the red-zone (128 bytes below RSP).
# GCC4.5.3 -O3 -fPIC to compile like glibc would
add_them:
movzx eax, al
sub rsp, 48 # reserve stack space, needed either way
lea rdx, 0[0+rax*4] # each movaps is 4 bytes long
lea rax, .L2[rip] # code pointer to after the last movaps
lea rsi, -136[rsp] # used later by va_arg. test/jz version does the same, but after the movaps stores
sub rax, rdx
lea rdx, 39[rsp] # used later by va_arg, test/jz version also does an LEA like this
jmp rax # AL=0 case jumps to L2
movaps XMMWORD PTR -15[rdx], xmm7 # using RDX as a base makes each movaps 4 bytes long, vs. 5 with RSP
movaps XMMWORD PTR -31[rdx], xmm6
movaps XMMWORD PTR -47[rdx], xmm5
movaps XMMWORD PTR -63[rdx], xmm4
movaps XMMWORD PTR -79[rdx], xmm3
movaps XMMWORD PTR -95[rdx], xmm2
movaps XMMWORD PTR -111[rdx], xmm1
movaps XMMWORD PTR -127[rdx], xmm0 # xmm0 last, will be ready for store-forwading last
.L2:
lea rax, 56[rsp] # first stack arg (if any), I think
## rest of the function
vs.
# GCC11.2 -O3 -fPIC
add_them:
sub rsp, 48
test al, al
je .L15 # only one test&branch macro-fused uop
movaps XMMWORD PTR -88[rsp], xmm0 # xmm0 first
movaps XMMWORD PTR -72[rsp], xmm1
movaps XMMWORD PTR -56[rsp], xmm2
movaps XMMWORD PTR -40[rsp], xmm3
movaps XMMWORD PTR -24[rsp], xmm4
movaps XMMWORD PTR -8[rsp], xmm5
movaps XMMWORD PTR 8[rsp], xmm6
movaps XMMWORD PTR 24[rsp], xmm7
.L15:
lea rax, 56[rsp] # first stack arg (if any), I think
lea rsi, -136[rsp] # used by va_arg. done after the movaps stores instead of before.
...
lea rdx, 56[rsp] # used by va_arg. With a different offset than older GCC, but used somewhat similarly. Redundant with the LEA into RAX; silly compiler.
GCC presumably changed strategy because the computed jump takes more static code size (I-cache footprint), and a test/jz is easier to predict than an indirect jump. Even more importantly, it's fewer uops executed in the common AL=0 (no-XMM) case2. And not many more even for the AL=1 worst case (7 dead movaps stores but no work done computing a branch target).
Related Q&As:
Semi-related while we're talking about calling-convention violations:
Footnote 1: AL, not RAX, is what matters
The x86-64 System V ABI doc specifies that variadic functions must look only at AL for the number of regs; the high 7 bytes of RAX are allowed to hold garbage. mov eax, 3 is an efficient way to set AL, avoiding possible false dependencies from writing a partial register, although it is larger in machine-code size (5 bytes) than mov al,3 (2 bytes). clang typically uses mov al, 3.
Key points from the ABI doc, see Why does eax contain the number of vector parameters? for more context:
The prologue should use %al to avoid unnecessarily saving XMM registers. This is especially important for integer only programs to prevent the initialization of the XMM unit.
(That last point is obsolete: XMM regs are widely used for memcpy/memset and inlined to zero-init small arrays / structs. So much so that Linux uses "eager" FPU save/restore on context switches, not "lazy" where the first use of an XMM reg faults.)
The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.
This ABI guarantee of AL <= 8 is what allows computed-jump implementations to omit bounds-checking. (Similarly, Does the C++ standard allow for an uninitialized bool to crash a program? yes, ABI violations can be assumed not to happen, e.g. by making code that would crash in that case.)
Footnote 2: efficiency of the two strategies
Smaller static code-size (I-cache footprint) is always a good thing, and the AL!=0 strategy has that in its favour.
Most importantly, fewer total instructions executed for the AL==0 case. printf isn't the only variadic function; sscanf is not rare, and it never takes FP args (only pointers). If a compiler can see that a function never uses va_arg with an FP argument, it omits saving entirely, making this point moot, but the scanf/printf functions are normally implemented as wrappers for the vfscanf / vfprintf calls, so the compiler doesn't see that, it sees a va_list being passed to another function so it has to save everything. (I think it's fairly rare for people to write their own variadic functions, so in a lot of programs the only calls to variadic functions will be to library functions.)
Out-of-order exec can chew through the dead stores just fine for AL<8 but non-zero cases, thanks to wide pipelines and store buffers, getting started on the real work in parallel with those stores happening.
Computing and doing the indirect jump takes 5 total instructions, not counting the lea rsi, -136[rsp] and lea rdx, 39[rsp]. The test/jz strategy also does those or similar, just after the movaps stores, as setup for the va_arg code which has to figure out when it gets to the end of the register-save area and switch to looking at stack args.
I'm also not counting the sub rsp, 48 either; that's necessary either way unless you make the XMM-save-area size variable as well, or only save the low half of each XMM reg so 8x 8 B = 64 bytes would fit in the red-zone. In theory variadic functions can take a 16-byte __m128d arg in an XMM reg so GCC uses movaps instead of movlps. (I'm not sure if glibc printf has any conversions that would take one). And in non-leaf functions like actual printf, you'd always need to reserve more space instead of using the red-zone. (This is one reason for the lea rdx, 39[rsp] in the computed-jump version: every movaps needs to be exactly 4 bytes, so the compiler's recipe for generating that code has to make sure their offsets are in the [-128,+127] range of a [reg+disp8] addressing mode, and not 0 unless GCC was going to use special asm syntax to force a longer instruction there.
Almost all x86-64 CPUs run 16-byte stores as a single micro-fused uop (only crusty old AMD K8 and Bobcat splitting into 8-byte halves; see https://agner.org/optimize/), and we'd usually be touching stack space below that 128-byte area anyway. (Also, the computed-jump strategy stores to the bottom itself, so it doesn't avoid touching that cache line.)
So for a function with one XMM arg, the computed-jump version takes 6 total single-uop instructions (5 integer ALU/jump, one movaps) to get the XMM arg saved.
The test/jz version takes 9 total uops (10 instructions but test/jz macro-fuse in 64-bit mode on Intel since Nehalem, AMD since Bulldozer IIRC). 1 macro-fused test-and-branch, and 8 movaps stores.
And that's the best case for the computed-jump version: with more xmm args, it still runs 5 instructions to compute the jump target, but has to run more movaps instructions. The test/jz version is always 9 uops. So the break-even point for dynamic uop count (actually executed, vs. sitting there in memory taking up I-cache footprint) is 4 XMM args which is probably rare, but it has other advantages. Especially in the AL == 0 case where it's 5 vs. 1.
The test/jz branch always goes to the same place for any number of XMM args except zero, making it easier to predict than an indirect branch that's different for printf("%f %f\n", ...) vs "%f\n".
3 of the 5 instructions (not including the jmp) in the computed-jump version form a dependency chain from the incoming AL, making it take that many more cycles before a misprediction can be detected (even though the chain probably started with a mov eax, 1 right before the call). But the "extra" instructions in the dump-everything strategy are just dead stores of some of XMM1..7 that never get reloaded and aren't part of any dependency chain. As long as the store buffer and ROB/RS can absorb them, out-of-order exec can work on them at its leisure.
(To be fair, they will tie up the store-data and store-address execution units for a while, meaning that later stores won't be ready for store-forwarding as soon either. And on CPUs where store-address uops run on the same execution units as loads, later loads can be delayed by those store uops hogging those execution units. Fortunately, modern CPUs have at least 2 load execution units, and Intel from Haswell to Skylake can run store-address uops on any of 3 ports, with simple addressing modes like this. Ice Lake has 2 load / 2 store ports with no overlap.)
The computed jump version has save XMM0 last, which is likely to be the first arg reloaded. (Most variadic functions go through their args in order). If there are multiple XMM args, the computed-jump way won't be ready to store-forward from that store until a couple cycles later. But for cases with AL=1 that's the only XMM store, and no other work tying up load/store-address execution units, and small numbers of args are probably more common.
Most of these reasons are really minor compared to the advantage of smaller code footprint, and fewer instructions executed for the AL==0 case. It's just fun (for some of us) to think through the up/down sides of the modern simple way, to show that even in its worst case, it's not a problem.