jump/jump compilation strategy for stackless functions. (Using a link register manually instead of call/ret)

Question

I am familiar with two basic strategies for structuring the function prologue/epilogue:

"Regular" functions: Move the stack pointer to the end of the stack frame (sub rsp, n), do the actual work, then move the stack pointer back (add rsp, n) and ret. (If there are many registers used in the body, there may additionally be some pushing and popping here.)
"Leaf" functions: Same as (1) but don't move the stack pointer, saving two instructions.

With strategy 2, you can't call functions inside the body, unless you move the stack pointer where it is supposed to be, which defeats the savings, which is why it's usually only used for leaf functions.

But it occurs to me that there is a third strategy one could use:

"Stackless" functions: Use mov rsi, AFTER; jump FUNCTION; AFTER: for the call sequence, and in the function just jump rsi at the end.

In this method, we ignore the stack pointer completely, so we have no stack space, but for a small function that might be doable. It also requires a custom calling convention, but compilers can do that if they want to for internal functions.

Since it pairs jump with jump, it doesn't touch the return stack so the branch predictor should not be thrown off (although the indirect jump at the end might be slower than a return), and there is no overhead for the memory writes incurred by call. Additionally, stackless functions can call other stackless functions (although not too much since you eventually run out of registers in which to store the return addresses, and there is a global optimization problem in ensuring that if A calls B then they use different return registers).

My question is: why don't compilers use method (3) more? AFAICT it doesn't ever show up when looking at functions compiled by gcc or clang. Is there a reason this calling convention is not usable?

3 is only possible if the compiler can see the code for both functions at once, to compile them both with the same convention, and if it can do that then it can probably just inline it. — Nate Eldredge, Dec 26 '20 at 23:02
Some other architectures have this built into their calling conventions, however, e.g. ARM where the return address is *always* put in a (fixed) "link register" instead of on the stack, and if the callee is not a leaf, then it has to take the responsibility to push and pop it. — Nate Eldredge, Dec 26 '20 at 23:05
@NateEldredge Yes, (3) will mostly only apply to functions that can be inlined. However, (3) is a proper function, not an assembly macro, in the sense that the code only has to exist in one place, and there may be a code size advantage to not inlining it if it is frequently used and large enough that the usual heuristics suggest that it shouldn't be inlined. — Mario Carneiro, Dec 26 '20 at 23:06
Actually, the compiler doesn't need visibility into both functions, as long as it communicates the calling convention. But this probably isn't very easy for functions from C, for instance, and the opportunities for global optimization are much reduced in this case, so it probably won't be making nested stackless function calls much in this mode. — Mario Carneiro, Dec 26 '20 at 23:10
I'd also be skeptical as to how well the branch predictor will handle this. `call/ret` is well optimized and the CPU can keep track of where the return will go, especially for a leaf function. A branch to a register is harder to predict, and AIUI, the predictor usually does something like guess that it will branch to the same destination as the last time, which is going to fail badly when the function is called from many places. It'd be interesting to benchmark it. — Nate Eldredge, Dec 26 '20 at 23:17
@MarioCarneiro Note that with (3), you eventually run out of registers to save return addresses, so a stack or similar will be needed to keep track of them. — fuz, Dec 26 '20 at 23:21
@fuz Yes, I mention this in the Q. Most likely these stackless functions will be "almost leaf" functions, since they can't call regular functions at all and can only call other stackless functions with increasingly strict calling conventions as more registers get used up on the calls. Since a stackless function doesn't know where the stack pointer is, it can't spill even if it wanted to. It's probably more effective on arches with lots of GPRs like ARM. — Mario Carneiro, Dec 26 '20 at 23:50
@NateEldredge I agree that the biggest question mark is the performance of the indirect jump in the return sequence. The reason I think it will not be as bad as a normal indirect jump is because there is no memory traffic at all in a stackless function, so the value that is being read out at the end is already in the register and would be handled by register renaming; and the actual instructions should still be in the cache assuming the stackless function is not massive. — Mario Carneiro, Dec 26 '20 at 23:56
Another problem is register assignment if one stackless function calls another. Not only must they use separate return address registers, their other register usages cannot conflict either. This would be practical only for small functions with few callers, at which point you may as well just inline them. — Raymond Chen, Dec 27 '20 at 01:23

score 4 · Accepted Answer · edited Dec 29 '20 at 09:28

4

Here's an attempt to benchmark both options.

    .text
    .align 8
    
subroutine:
    inc %rdx
#ifdef REG_CALL
    jmp *%rsi
#else
    ret
#endif

    reps = 1000000
    .global main
    
main:
    push %rbp
    mov $reps, %ecx
    xor %edx, %edx
    .align 8
top:
    .rept 1000
#ifdef REG_CALL
    lea 0f(%rip), %rsi
    jmp subroutine
0:  
#else
    call subroutine
#endif
    .endr
    dec %ecx
    jnz top

    lea format(%rip), %rdi
    mov %rdx, %rsi
    xor %eax, %eax
    call printf
    xor %eax, %eax
    pop %rbp
    ret

    .data
format: .asciz "%ld calls done\n"

It does 1000 calls to the subroutine from varying return addresses, repeated one million times. Assemble with no options for traditional call/ret, with -DREG_CALL for your indirect jump proposal.

On an i7-8565U CPU @ 1.80GHz, traditional takes 1.6 seconds and REG_CALL takes about 3.2. So your proposal seems to be about twice as slow.

As I mentioned in comments, I suspect the indirect branch predictor can't keep track of where the jmp *%rsi is going to go.

Aside from runtime inefficiencies, Raymond Chen mentions another major disadvantage of this strategy in the comments:

Another problem is register assignment if one stackless function calls another. Not only must they use separate return address registers, their other register usages cannot conflict either. This would be practical only for small functions with few callers, at which point you may as well just inline them.

edited Dec 29 '20 at 09:28

Cody Gray - on strike

239,200
50
490
574

answered Dec 27 '20 at 00:02

Nate Eldredge

48,811
6
54
82

1

Such a tight loop and tiny function is somewhat unrealistic (if you have visibility you'd of course just inline), and totally front-end bound. And of course this doesn't try to measure pollution of indirect branch-prediction on overall prediction rate. (Only one call site, and only the loop branch). – Peter Cordes Dec 29 '20 at 09:49
I'd expect it to still be slower in more realistic cases with larger functions from wasting more uops, but not 2x. (The top-of-stack cache line tends to stay hot in cache, and execution of the called function doesn't have to wait for the ret addr store to commit to cache so that's not an advantage.) `call rel32` is more compact than lea/jmp, although `call rel32` and `ret` near are 2-uop instructions on Intel while `jmp rel8` or `jmp reg` are 1 uop.). – Peter Cordes Dec 29 '20 at 09:51
@PeterCordes: Yes, of course. I was trying to measure only the expense of the call/return, since that's all we propose to change. Do note, however, that there are in fact 1000 call sites (I was trying to badly pollute the indirect branch predictor cache), and the loop is not all that tight. – Nate Eldredge Dec 29 '20 at 16:58
Oh, I missed the `.rept 1000`. Yeah that's more interesting. But note that cost isn't one-dimensional, and back-to-back jumps and calls with no straight-line blocks of code more than a couple uops is an unusual case for the front-end. – Peter Cordes Dec 29 '20 at 23:20

score -2 · Answer 2 · answered Dec 27 '20 at 07:14

Conventions are there for a reason and designed so that if each function conforms, then you can call one or many functions from another and for some languages perform recursion.

You are perhaps assuming that compiler optimization is perfect and/or compilers are perfect or something. They are not and are not assumed to be. It is quite common to find missed optimizations in most normal sized projects. (gcc for example has been getting worse not better with each new major revision)

Reducing the number of function calls yourself may (or may not) improve the compiled output rather than hoping the compiler can guess what you wanted it to do.

A leaf function is by definition a leaf function the end of the branch of the tree, if it calls another function it is not a leaf function.

You certainly do not want to break the convention as then that code cannot be used/linked, it must conform or the whole system breaks down. You instead want to perhaps get more inlining happening by controlling the optimization domain. There is no valid reason for a successful compiler to intentionally make broken code, that is why they do not create unconventional functions that cannot be linked or used.

At the end of the day if this is a concern then take the compiler output and hand tune it to remove the instructions or overhead you do not like.

so

Breaking the convention means the whole tool is compromised and it will not make usable programs, that is not an option. Your only two choices are to get the code in question into the same optimization domain, or to hand tune the compiler output yourself.

and

there are at least two major toolchains that are open source, you are free to modify them and or create your own to demonstrate this multi-convention thing with some file format that communicates the convention to the linker plus some other way to communicate the convention to nested functions in some cart before the horse manner. Or perhaps the compiler generates output for each function times the combinations of all function calls within a function. Two conventions, one function calls two other functions that is 2 times 2 times 2, 8 generated implementations of that function in the object file. If a function calls 15 other functions and those functions call other functions and so on you can see one of the problems here. god forbid if you call a function in a loop. It is technically possible if you have the disk/memory space. In the long run it adds very little value but a huge amount of risk, when other, current, solutions meet somewhere in the middle with respect to value/risk.

"You are perhaps assuming that compiler optimization is perfect and/or compilers are perfect or something." Not at all. I'm actually implementing a compiler, and I thought I would try this compilation strategy out if there wasn't a glaring flaw in it. As far as assuming that gcc/clang are "perfect", no I don't think they are perfect, but it certainly seems like they have made an attempt to try just about everything that can be tried within the constraints of the architecture, even if their heuristics for which to use when aren't perfect. — Mario Carneiro, Dec 27 '20 at 08:37
"Conventions are there for a reason and designed so that if each function conforms, then you can call one or many functions from another and for some languages perform recursion." I disagree with this interpretation of the term "calling convention". It's a convention between the caller and the callee, no one else. A function with a specialized calling convention can be called as long as you conform to that convention, and as long as a single compiler is doing the whole build it's not difficult to pass that convention around. ... — Mario Carneiro, Dec 27 '20 at 08:39
... At *external* function call boundaries, you need to adhere to whatever the ABI is, but in internal code the compiler is free to choose whatever convention it likes, because it's sitting on both sides of the convention. — Mario Carneiro, Dec 27 '20 at 08:40
"Breaking the convention means the whole tool is compromised and it will not make usable programs, that is not an option." This is ridiculously overstated. Many programming languages have custom calling conventions and still support separate compilation; C++, Rust, Go, Haskell all have their own language-specific calling conventions. — Mario Carneiro, Dec 27 '20 at 08:42
In any case, I'm not targeting separate compilation here, I think that this is dead weight in the C/C++ world. I assume visibility into everything, so global optimization problems are hard but not impossible to solve. If you do separate compilation, you can't use stackless functions across modules. Big deal; just use a regular function in those cases. — Mario Carneiro, Dec 27 '20 at 08:45

jump/jump compilation strategy for stackless functions. (Using a link register manually instead of call/ret)

2 Answers2