Why does printf still work with RAX lower than the number of FP args in XMM registers?

Question

I am following the book "Beginning x64 Assembly Programming", in Linux 64 system. I am using NASM and gcc.
In the chapter about floating point operations the book specifies the below code for adding 2 float numbers. In the book, and other online sources, I have read that register RAX specifies the number of XMM registers to be used, according to calling conventions.
The code in the book goes as follows:

extern printf
section .data
num1        dq  9.0
num2        dq  73.0
fmt     db  "The numbers are %f and %f",10,0
f_sum       db  "%f + %f = %f",10,0

section .text
global main
main:
    push rbp
    mov rbp, rsp
printn:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, fmt
    mov rax, 2      ;for printf rax specifies amount of xmm registers
    call printf

sum:
    movsd xmm2, [num1]
    addsd xmm2, [num2]
printsum:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, f_sum
    mov rax, 3
    call printf

That works as expected.
Then, before the last printf call, I tried changing

mov rax, 3

for

mov rax, 1

Then I reassembled and ran the program.

I was expecting some different nonsense output, but I was surprised the output was exactly the same. printf outputs the 3 float values correctly:

The numbers are 9.000000 and 73.000000
9.000000 + 73.000000 = 82.000000

I suppose there is some kind of override when printf is expecting the use of several XMM registers, and as long as RAX is not 0, it will use consecutive XMM registers. I have searched for an explanation in calling conventions and NASM manual, but didn't find one.

What is the reason why this works?

The called function may use the value in `rax` (actually `al`) but it is not required to. The caller, however, should set it properly. Some versions of `printf` only care about zero vs. non-zero. In any case it has no effect on the location of the arguments. — Jester, Apr 23 '22 at 18:34
See for instance https://godbolt.org/z/f869oen9W , which spills no xmm registers if `al` is zero, and all of them if it's nonzero. Probably the code to spill only the right number would be more trouble than it's worth. An unnecessary store to memory doesn't cost much, especially to the stack which is likely already in L1 or L2 cache. But hopefully it is obvious that you cannot rely on the called function doing that. — Nate Eldredge, Apr 23 '22 at 18:41
Thank you for your answers. I am just starting and got annoyed by this. — pesotsan, Apr 23 '22 at 19:05
Just because it seems to work now doesn't mean that it will always work. — Solomon Ucko, Apr 24 '22 at 01:13

Peter Cordes · Accepted Answer · 2022-04-25T13:39:02.533

The x86-64 SysV ABI's strict rules allow implementations that only save the exact number of XMM regs specified, but current implementations only check for zero / non-zero because that's efficient, especially for the AL=0 common case.

If you pass a number in AL¹ lower than the actual number of XMM register args, or a number higher than 8, you'd be violating the ABI, and it's only this implementation detail which stops your code from breaking. (i.e. it "happens to work", but is not guaranteed by any standard or documentation, and isn't portable to some other real implementations, like older GNU/Linux distros that were built with GCC4.5 or earlier.)

This Q&A shows a current build of glibc printf which just checks for AL!=0, vs. an old build of glibc which computes a jump target into a sequence of movaps stores. (That Q&A is about that code breaking when AL>8, making the computed jump go somewhere it shouldn't.)

Why does eax contain the number of vector parameters? quotes the ABI doc, and shows ICC code-gen which similarly does a computed jump using the same instructions as old GCC.

Glibc's printf implementation is compiled from C source, normally by GCC. When modern GCC compiles a variadic function like printf, it makes asm that only checks for a zero vs. non-zero AL, dumping all 8 arg-passing XMM registers to an array on the stack if non-zero.

GCC4.5 and earlier actually did use the number in AL to do a computed jump into a sequence of movaps stores, to only actually save as many XMM regs as necessary.

Nate's simple example from comments on Godbolt with GCC4.5 vs. GCC11 shows the same difference as the linked answer with disassembly of old/new glibc (built by GCC), unsurprisingly. This function only ever uses va_arg(v, double);, never integer types, so it doesn't dump the incoming RDI...R9 anywhere, unlike printf. And it's a leaf function so it can use the red-zone (128 bytes below RSP).

# GCC4.5.3 -O3 -fPIC    to compile like glibc would
add_them:
        movzx   eax, al
        sub     rsp, 48                  # reserve stack space, needed either way
        lea     rdx, 0[0+rax*4]          # each movaps is 4 bytes long
        lea     rax, .L2[rip]            # code pointer to after the last movaps
        lea     rsi, -136[rsp]             # used later by va_arg.  test/jz version does the same, but after the movaps stores
        sub     rax, rdx
        lea     rdx, 39[rsp]               # used later by va_arg, test/jz version also does an LEA like this
        jmp     rax                      # AL=0 case jumps to L2
        movaps  XMMWORD PTR -15[rdx], xmm7     # using RDX as a base makes each movaps 4 bytes long, vs. 5 with RSP
        movaps  XMMWORD PTR -31[rdx], xmm6
        movaps  XMMWORD PTR -47[rdx], xmm5
        movaps  XMMWORD PTR -63[rdx], xmm4
        movaps  XMMWORD PTR -79[rdx], xmm3
        movaps  XMMWORD PTR -95[rdx], xmm2
        movaps  XMMWORD PTR -111[rdx], xmm1
        movaps  XMMWORD PTR -127[rdx], xmm0   # xmm0 last, will be ready for store-forwading last
.L2:
        lea     rax, 56[rsp]       # first stack arg (if any), I think
     ## rest of the function

vs.

# GCC11.2 -O3 -fPIC
add_them:
        sub     rsp, 48
        test    al, al
        je      .L15                          # only one test&branch macro-fused uop
        movaps  XMMWORD PTR -88[rsp], xmm0    # xmm0 first
        movaps  XMMWORD PTR -72[rsp], xmm1
        movaps  XMMWORD PTR -56[rsp], xmm2
        movaps  XMMWORD PTR -40[rsp], xmm3
        movaps  XMMWORD PTR -24[rsp], xmm4
        movaps  XMMWORD PTR -8[rsp], xmm5
        movaps  XMMWORD PTR 8[rsp], xmm6
        movaps  XMMWORD PTR 24[rsp], xmm7
.L15:
        lea     rax, 56[rsp]        # first stack arg (if any), I think
        lea     rsi, -136[rsp]      # used by va_arg.  done after the movaps stores instead of before.
...
        lea     rdx, 56[rsp]        # used by va_arg.  With a different offset than older GCC, but used somewhat similarly.  Redundant with the LEA into RAX; silly compiler.

GCC presumably changed strategy because the computed jump takes more static code size (I-cache footprint), and a test/jz is easier to predict than an indirect jump. Even more importantly, it's fewer uops executed in the common AL=0 (no-XMM) case². And not many more even for the AL=1 worst case (7 dead movaps stores but no work done computing a branch target).

Related Q&As:

Assembly executable doesn't show anything (x64) AL != 0 vs. computed jump code-gen for glibc printf
Why is %eax zeroed before a call to printf? shows modern GCC code-gen
Why does eax contain the number of vector parameters? ABI documentation references for why it's like that
mold and lld not linking against libc correctly discussion of various possible ABI-violations and other ways a program could fail to work when calling printf from _start (depending on dynamic-linker hooks to get libc startup functions called).

Semi-related while we're talking about calling-convention violations:

glibc scanf Segmentation faults when called from a function that doesn't align RSP (and even more recently, also printf with AL=0, using movaps somewhere other than dumping XMM args to the stack)

Footnote 1: AL, not RAX, is what matters

The x86-64 System V ABI doc specifies that variadic functions must look only at AL for the number of regs; the high 7 bytes of RAX are allowed to hold garbage. mov eax, 3 is an efficient way to set AL, avoiding possible false dependencies from writing a partial register, although it is larger in machine-code size (5 bytes) than mov al,3 (2 bytes). clang typically uses mov al, 3.

Key points from the ABI doc, see Why does eax contain the number of vector parameters? for more context:

The prologue should use %al to avoid unnecessarily saving XMM registers. This is especially important for integer only programs to prevent the initialization of the XMM unit.

(That last point is obsolete: XMM regs are widely used for memcpy/memset and inlined to zero-init small arrays / structs. So much so that Linux uses "eager" FPU save/restore on context switches, not "lazy" where the first use of an XMM reg faults.)

The contents of %al do not need to match exactly the number of registers, but must be an upper bound on the number of vector registers used and is in the range 0–8 inclusive.

This ABI guarantee of AL <= 8 is what allows computed-jump implementations to omit bounds-checking. (Similarly, Does the C++ standard allow for an uninitialized bool to crash a program? yes, ABI violations can be assumed not to happen, e.g. by making code that would crash in that case.)

Footnote 2: efficiency of the two strategies

Smaller static code-size (I-cache footprint) is always a good thing, and the AL!=0 strategy has that in its favour.

Most importantly, fewer total instructions executed for the AL==0 case. printf isn't the only variadic function; sscanf is not rare, and it never takes FP args (only pointers). If a compiler can see that a function never uses va_arg with an FP argument, it omits saving entirely, making this point moot, but the scanf/printf functions are normally implemented as wrappers for the vfscanf / vfprintf calls, so the compiler doesn't see that, it sees a va_list being passed to another function so it has to save everything. (I think it's fairly rare for people to write their own variadic functions, so in a lot of programs the only calls to variadic functions will be to library functions.)

Out-of-order exec can chew through the dead stores just fine for AL<8 but non-zero cases, thanks to wide pipelines and store buffers, getting started on the real work in parallel with those stores happening.

Computing and doing the indirect jump takes 5 total instructions, not counting the lea rsi, -136[rsp] and lea rdx, 39[rsp]. The test/jz strategy also does those or similar, just after the movaps stores, as setup for the va_arg code which has to figure out when it gets to the end of the register-save area and switch to looking at stack args.

I'm also not counting the sub rsp, 48 either; that's necessary either way unless you make the XMM-save-area size variable as well, or only save the low half of each XMM reg so 8x 8 B = 64 bytes would fit in the red-zone. In theory variadic functions can take a 16-byte __m128d arg in an XMM reg so GCC uses movaps instead of movlps. (I'm not sure if glibc printf has any conversions that would take one). And in non-leaf functions like actual printf, you'd always need to reserve more space instead of using the red-zone. (This is one reason for the lea rdx, 39[rsp] in the computed-jump version: every movaps needs to be exactly 4 bytes, so the compiler's recipe for generating that code has to make sure their offsets are in the [-128,+127] range of a [reg+disp8] addressing mode, and not 0 unless GCC was going to use special asm syntax to force a longer instruction there.

Almost all x86-64 CPUs run 16-byte stores as a single micro-fused uop (only crusty old AMD K8 and Bobcat splitting into 8-byte halves; see https://agner.org/optimize/), and we'd usually be touching stack space below that 128-byte area anyway. (Also, the computed-jump strategy stores to the bottom itself, so it doesn't avoid touching that cache line.)

So for a function with one XMM arg, the computed-jump version takes 6 total single-uop instructions (5 integer ALU/jump, one movaps) to get the XMM arg saved.

The test/jz version takes 9 total uops (10 instructions but test/jz macro-fuse in 64-bit mode on Intel since Nehalem, AMD since Bulldozer IIRC). 1 macro-fused test-and-branch, and 8 movaps stores.

And that's the best case for the computed-jump version: with more xmm args, it still runs 5 instructions to compute the jump target, but has to run more movaps instructions. The test/jz version is always 9 uops. So the break-even point for dynamic uop count (actually executed, vs. sitting there in memory taking up I-cache footprint) is 4 XMM args which is probably rare, but it has other advantages. Especially in the AL == 0 case where it's 5 vs. 1.

The test/jz branch always goes to the same place for any number of XMM args except zero, making it easier to predict than an indirect branch that's different for printf("%f %f\n", ...) vs "%f\n".

3 of the 5 instructions (not including the jmp) in the computed-jump version form a dependency chain from the incoming AL, making it take that many more cycles before a misprediction can be detected (even though the chain probably started with a mov eax, 1 right before the call). But the "extra" instructions in the dump-everything strategy are just dead stores of some of XMM1..7 that never get reloaded and aren't part of any dependency chain. As long as the store buffer and ROB/RS can absorb them, out-of-order exec can work on them at its leisure.

(To be fair, they will tie up the store-data and store-address execution units for a while, meaning that later stores won't be ready for store-forwarding as soon either. And on CPUs where store-address uops run on the same execution units as loads, later loads can be delayed by those store uops hogging those execution units. Fortunately, modern CPUs have at least 2 load execution units, and Intel from Haswell to Skylake can run store-address uops on any of 3 ports, with simple addressing modes like this. Ice Lake has 2 load / 2 store ports with no overlap.)

The computed jump version has save XMM0 last, which is likely to be the first arg reloaded. (Most variadic functions go through their args in order). If there are multiple XMM args, the computed-jump way won't be ready to store-forward from that store until a couple cycles later. But for cases with AL=1 that's the only XMM store, and no other work tying up load/store-address execution units, and small numbers of args are probably more common.

Most of these reasons are really minor compared to the advantage of smaller code footprint, and fewer instructions executed for the AL==0 case. It's just fun (for some of us) to think through the up/down sides of the modern simple way, to show that even in its worst case, it's not a problem.

Very nice answer and very nice addition to the signature block. You would think they would come to you for design help as well `:)` — David C. Rankin, Apr 24 '22 at 04:17
"so GCC uses `movaps` instead of `movlps`" Did you mean `movups` here? — ecm, Apr 24 '22 at 19:02
@ecm: No, I was saying that if GCC code-gen only needed to support 4-byte or 8-byte args in XMM registers (not `__m128` or whatever else might use a full XMM), it would only need 8-byte save slots. And dumping to those is best done with `movlps`, which is the same as `movsd` for storing, but has shorter machine code by 1 byte. — Peter Cordes, Apr 24 '22 at 22:20
@PeterCordes re: "test/jz is easier to predict than an indirect jump" in this case wouldn't the two be equivalent? Indirect jumps IIRC [predict the previous target dynamically and statically the next instruction](https://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf#143) so couldn't you get same behavior by just having the zero case be the next instruction after the branch? Statically just as good and dynamically should essentially predict the same as zero-non-zero (last target=zero, or other=non-zero). — Noah, Apr 24 '22 at 23:30
@Noah: Note that AL=2 needs a different prediction than AL=1, unlike with the all-or-nothing strategy, so it's a more complex pattern to predict if it's not always integer. But also, I was thinking indirect jumps might have a slightly higher miss penalty on mispredict, which is different from how I said it but still relevant. Also, it's a *computed* jump, not a jump table, so unless you did even more integer work to special-case zero to put the block of save instructions out-of-line somewhere, you're not going to get the AL=0 case to be a fall-through. (Or fall into a `jmp rel8`...) — Peter Cordes, Apr 25 '22 at 00:00
@PeterCordes my point is that the way indirect branch prediction works now is essentially all or nothing on one target. It predicts fall through or previous path (so we can say zero/non-zero). In the non-zero case there will ofc be misses based on the value of `AL` but the `AL=0` path should be just about as optimizable. Think it wouldn't be so hard to implement zero as fall through (in asm at least, not arguing GCC would get this optimally). — Noah, Apr 25 '22 at 13:33
@Noah: I think this asm is basically a hand-written recipe that GCC regurgitates when `va_start` uses a `__builtin` function. (That might explain failure to CSE `lea rax, 56[rsp]` with the same LEA into `rdx`.) If you have an idea that you think would actually be worth re-introducing extra code-size (and extra instructions that run even for the AL=0 case), GCC could presumably be taught to use that strategy. Or just as a thought exercise. Are you talking about putting the block of movaps instructions out-of-line so the common case is a "fall through" *not* into that? Like an extra cmov? — Peter Cordes, Apr 25 '22 at 13:42
@PeterCordes yeah as in `case 0: start_printf: /* printf code here. */; break; case N: /* save N fp reg. */; goto start_printf; unreachable();`. Where did you see that indirect mispredicts are more expensive though? If that cost vs cost of direct branch miss is high enough then its clearly not worth doing. — Noah, Apr 25 '22 at 14:26
@Noah: That was guesswork / intuition on my part. Maybe with a shade of influence from this specific case where the indirect branch target has a longer dep chain before it can be checked than the conditional branch, because AL needs to be zero-extended, scaled, and added to an anchor address. So that's an extra 3 or 4 cycles before the prediction can be checked and recovery can start. And those extra instructions have to run even in the AL=0 case. — Peter Cordes, Apr 25 '22 at 14:45
@Noah: Big picture: are variadic function calls with FP args something we really want to be optimizing for in 2022, at the expense of integer-only and overall code-size? If anything, putting the block of movaps instructions out-of-line for the all-or-nothing case would make sense, so the AL=0 fast path becomes a not-taken `test al,al/jnz`. Other than printf / sprintf, I'd guess most variadic functions are integer. — Peter Cordes, Apr 25 '22 at 14:46
@PeterCordes Thats fair and I tend to agree. Also you will be in an unpleasant spot where you will either 1) need to waste a bunch of code size to leave printf internal inlined, 2) outline printf internals, 3) use indirect jump table, or 4) give up `AL=0` as next instruction. Any of which alone would be a deal-breaker. With the extra cycles for computation can't imagine its worth it. — Noah, Apr 25 '22 at 14:52
@PeterCordes question, what would you do about [this ssse3 memmove](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memmove-ssse3.S;h=310ff62b86de33cc26d4f1263594c8ab6119630c;hb=HEAD#l168). Recently added it with jump table. We need the jump table for `palignr` because this is for x86_64 cpus without fast unaligned loads / stores. Have some profiles, for calls that hit loop `82.5%` are naturally aligned. You think a conditional branch would be worth it there? (Also realize now I ought to move the indirect jump so next instruction is beginning of `start_loop`. — Noah, Apr 25 '22 at 15:02
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/244191/discussion-between-peter-cordes-and-noah). — Peter Cordes, Apr 25 '22 at 15:04

Why does printf still work with RAX lower than the number of FP args in XMM registers?

1 Answers1

Footnote 1: AL, not RAX, is what matters

Footnote 2: efficiency of the two strategies

Linked

Related