Is there any difference between pushing registers before stack frame creation or after?

Question

Suppose I have a function called func:

PROC func:
    ;Bla bla
    ret
ENDP func

Now, suppose that I use register ax and bx for example, so to save their initial value I push them to the stack inside the function.

Now to the question: Is there any vast different between pushing the registers before the creation of the stack frame:

PROC func:
    push bp
    push ax
    push bx
    mov bp, sp
    ;Bla bla
    ret
ENDP func

Or after?

PROC func:
    push bp
    mov bp, sp
    push ax
    push bx
    ;Bla bla
    ret
ENDP func

And what should I use in my programs? Is one method better or "more correct" than the other? Because I use the first method currently.

no, you'd do `push bp` and then `push bp, sp` . And the advantage is that if you do that at the **beginning** then in 16-bit code the first parameter will always be at `[bp+4]` , second parameter at `[bp+6]` etc even if you do many more pushes on the stack after doing `mov bp, sp`. That can be easier for humans to maintain the code. From a high level compiler perspective it doesn't make much difference. Advantage: maintainability. — Michael Petch, Dec 01 '19 at 20:42
So my approach was indeed the better? :D Glad to hear, A friend of mine is trying to convince me that the second method is the correct one, and now I can prove him wrong :) Thank you sir — Kidsm, Dec 01 '19 at 20:48
No, from a readability and maintainability standpoint for human generated assembler your friends approach would be better IMHO. In fact a lot of code that gets generated by compilers and humans follows the pattern of pushing bp first, then doing mov bp, sp and then doing the reverse just before the ret. Generally developers find reading parameters from a positive offset from BP and the local variables at a negative offset. It is a more common convention. — Michael Petch, Dec 01 '19 at 20:50
I think your confusion is that the case of `push bp` `mov bp, sp` is done before all the other pushes. So it would look like `push bp` `mov bp, sp` `push ax` `push bx` etc. Most people I know find that preferable to `push ax` `push bx` `push bp` `mov bp, sp` — Michael Petch, Dec 01 '19 at 20:53
My approach was pushing before mob bp, sp XD he suggested to do the mov before the pushing I used the first approach because I find it easier to work with positive values for the registers and negetive for the local variables just like you said :D — Kidsm, Dec 01 '19 at 20:54
@Kidsm: To simplify your asm, you can write functions that simply *don't* save all the registers they use. In 32-bit calling conventions, functions are allowed to destroy EAX, ECX and EDX without saving/restoring them. Having a few [call-clobbered aka volatile](https://stackoverflow.com/questions/9268586/what-are-callee-and-caller-saved-registers) scratch registers means less push/pop for simple functions. — Peter Cordes, Dec 01 '19 at 20:54
No, what I am saying is that you PUSH **ONLY** BP followed by `mov bp, sp` and then you do all the other pushes that you need to save. And before you ever modify `bp` you need to save it so it can be restored to its original value. You second code snippet is wrong because it doesn't even push bp. — Michael Petch, Dec 01 '19 at 20:55
You mean for the push bp? If you mean what is the reason for `push bp` and `mov bp, sp` it is to create a stack frame. — Michael Petch, Dec 01 '19 at 20:58
@Kidsm: Remember that `bp` is non-volatile; you have to preserve your caller's value in that register. (It would be pretty inconvenient if you had to save BP in another register across function calls your own code makes). It's would be useless to `push bp` *after* you destroy the caller's value with `mov bp,sp`. Saving/restoring AX but not BP is insane. — Peter Cordes, Dec 01 '19 at 20:58
@MichaelPetch for pushing bp, then creating the stack frame and then pushing the registers — Kidsm, Dec 01 '19 at 20:59
My last comment answers that question you were asking of Michael :P — Peter Cordes, Dec 01 '19 at 20:59
I edited the code in the question, wrote the functions incorrectly — Kidsm, Dec 01 '19 at 21:00
@PeterCordes Look at the edits i've made to the question, I accidently forgot to wrote push bp in the second function :/ — Kidsm, Dec 01 '19 at 21:05
What assembler are you using (MASM/TASM/JWASM?) and what version. Or are you using EMU8086? or something else? — Michael Petch, Dec 02 '19 at 00:40

Peter Cordes · Accepted Answer · 2019-12-02T00:48:05.783

The second way, push bp ; mov bp, sp before pushing any more registers, means your first stack arg is always at [bp+4] regardless of how many more pushes you do¹. This doesn't matter if you passed all the args in registers instead of on the stack, which is easier and more efficient most of the time if you only have a couple.

This is good for maintainability by humans; you can change how many registers you save/restore without changing how you access args. But you do still have to avoid the space right below BP; saving more regs means you might put the highest local var at [bp-6] instead of [bp-4].

Footnote: A "far proc" has a 32-bit CS:IP return address so args start at [bp+6] in that case. See @MichaelPetch's comments about letting tools like MASM sort this out for you with symbolic names for args and local vars.

Also, for backtracing up the call stack, it means that your caller's bp value points a saved BP value in your caller's stack frame, forming a linked list of BP / ret-addr values a debugger can follow. Doing more pushes before mov bp,sp would leave BP pointing elsewhere. See also When do we create base pointer in a function - before or after local variables? for more details about this, on a very similar question for 32-bit mode. (Note that 32 and 64-bit code can use [esp +- x] addressing modes, but 16-bit code can't. 16-bit code is basically forced to set up BP as a frame pointer to access its own stack frame.)

I stack-traces are one of the primary reasons for mov bp,sp right after push bp being the standard convention. As opposed to some other equally valid convention like doing all your pushes and then mov bp,sp.

If you push bp last, you can use the leave instruction before pop/pop/ret in the epilogue. (It depends on BP pointing to the saved-BP value).

The leave instruction can save code-size as a compact version of mov sp,bp ; pop bp. (It's not magic, that's all it does. It's totally fine to not use it. And enter is very slow on modern x86, never use it.) You can't really use leave if you have other pops to do first. After add sp, whatever to point SP at your saved BX value, you do pop bx and then you might as well just use pop bp instead of leave. So leave is only useful in a function that makes a stack frame but doesn't push any other registers after. But does reserve some extra space with sub sp, 20 for example, so sp isn't still pointing at something you want to pop.

Or you might use something like this so offsets to stack args and to locals are independent of how many registers you push/pop other than BP. I don't see any obvious downside to this but maybe there's some reason I missed why it's not the usual convention.

func:
    push  bp
    mov   bp,sp
    sub   sp, 16   ; space for locals from [bp-16] to [bp-1]
    push  bx       ; save some call-preserved regs *below* that
    push  si

    ...  function body

    pop   si
    pop   bx
    leave         ; mov sp, bp;   pop bp
    ret

Modern GCC tends to save any call-preserved regs before sub esp, imm. e.g.

void ext(int);  // non-inline function call to give GCC a reason to save/restore a reg

void foo(int arg1) {
    volatile int x = arg1;
    ext(1);
    ext(arg1);
    x = 2;
 //   return x;
}

gcc9.2 -m32 -O3 -fno-omit-frame-pointer -fverbose-asm on Godbolt

foo(int):
        push    ebp     #
        mov     ebp, esp  #,
        push    ebx                                       # save a call-preserved reg
        sub     esp, 32   #,
        mov     ebx, DWORD PTR [ebp+8]    # arg1, arg1    # load stack arg

        push    1       #
        mov     DWORD PTR [ebp-12], ebx   # x = arg1
        call    ext(int) #

        mov     DWORD PTR [esp], ebx      #, arg1
        call    ext(int) #

        mov     DWORD PTR [ebp-12], 2     # x,
        mov     ebx, DWORD PTR [ebp-4]    #,      ## restore EBX with mov instead of pop
        add     esp, 16   #,                      ## missed optimization, let leave do this
        leave   
        ret

Restoring the call-preserved registers with mov instead of pop lets GCC still use leave. If you tweak the function to return a value, GCC avoids the wasted add esp,16.

BTW, you can shorten your code by letting functions destroy at least AX without saving/restoring. i.e. treat them as call-clobbered, aka volatile. Normal 32-bit calling conventions have EAX, ECX, and EDX volatile (like what GCC is compiling for in the example above: Linux's i386 System V), but many different 16-bit conventions exist which are different.

Having one of SI, DI, or BX volatile would let functions access memory without needing to push/pop their caller's copy of it.

Agner Fog's calling convention guide includes some standard 16-bit calling conventions, see the table at the start of chapter 7 for 16-bit conventions used by existing C/C++ compilers. @MichaelPetch suggests the Watcom convention: AX and ES are always call-clobbered, but args are passed in AX, BX, CX, DX. Any reg used for arg-passing is also call-clobbered. And so is SI when used to pass a pointer to where the function should store a large return-value.

Or at the extreme, choose a custom calling convention on a per-function basis, according to what's most efficient for that function and for its callers. But that would quickly become a maintenance nightmare; if you want that kind of optimization just use a compiler and let it inline short functions and optimize them into the caller, or do inter-procedural optimization based on which registers are actually used by a function.

The calling conventions for DOS (not including roll your own) are quite a bit different and varied than modern ones and varied from compiler to compiler. It should be noted that a function in 16-bit code that is reached via a far call would have the first parameter at bp+6. So its not necessarily always true and it depends on the nature of the function you are creating. As well with MASM with stack based calling convention you can use MASM directives on and after the PROC to say what the parameters and local variables are (by name)&let the assembler handle the drudgery of computing BP offsets — Michael Petch, Dec 02 '19 at 00:22
In the case of a .COM program the default model is tiny (similar to small) so it will be a near call. In other models the default may be a far call (segment:offset) address. Using PROC and MASM local directives to define functions can reduce these headaches as it knows from the model default (or PROC overrides) if something is near or far and change `ret` to be `retn` or `retf`. Much easier to write code that may be assembled in different models. — Michael Petch, Dec 02 '19 at 00:28
@MichaelPetch: Good points. I thought about editing the opening paragraph to mention far procedures, but decided not to clutter it and just talk about the simplest case. Maybe a footnote. Do you have a suggestion for any nice 16-bit calling conventions with a well-chosen set of call-clobbered regs? — Peter Cordes, Dec 02 '19 at 00:28
The convention I use is Watcom C 16-bit. They created the first compiler that had a pass by register convention (microsoft mimicked it with their own version later on). I believe the Agner Fog calling conventions include the specifics of that convention. You could always direct people to Agner's document with calling conventions. Whatever one you decide to pick, its easier to just be consistent and you'd have to pick the appropriate one if interfacing with a language that has a specific convention. — Michael Petch, Dec 02 '19 at 00:30
A question with an answer that is vaguely related to this discussion, but does give ideas on how you can use MASM directives to simplify parameter passing can be found here: https://stackoverflow.com/questions/36293714/pass-by-value-and-pass-by-reference-in-assembly . A caveat is that some old MASM versions don't support all the directives and EMU8086 is limited as well. I don't know what this person is using. — Michael Petch, Dec 02 '19 at 00:36

ecm · Answer 2 · 2019-12-01T22:16:02.387

In my programs I generally use the second method, that is, creating the stack frame first. This is done using push bp \ mov bp, sp and then optionally push ax once or twice or lea sp, [bp - x] to reserve space for uninitialised variables. (I let my stack frame macros create these instructions.) You can then further optionally push onto the stack to reserve space for and at the same time initialise further variables. After the variables, registers to preserve across the function's execution may be pushed.

There is a third way that you did not list as an example in your question. It looks like this:

PROC func:
    push ax
    push bx
    push bp
    mov bp, sp
    ;Bla bla
    ret
ENDP func

For my usage, the second and third ways are easily possible. I could use the third way if I push things first then for the stack frame creation specify what I call "how large the return address and other things between bp and the last parameter are" in my lframe macro invocation.

But it is easier to always push registers after setting up the frame (second method). In this case, I can always specify the "type of frame" as near, which is almost entirely equivalent to 2; that is so because the near 16-bit return address takes up 2 bytes.

Here is an example of a stack frame with registers preserved by pushing them:

        lframe near, nested
        lpar word,      inp_index_out_segment
        lpar word,      out_offset
        lpar_return
        lenter
        lvar dword,     start_pointer
         push word [sym_storage.str.start + 2]
         push word [sym_storage.str.start]
        lvar word,      orig_cx
         push cx
        mov cx, SYMSTR_index_size

        ldup

        lleave ctx
        lleave ctx

                ; INP:  ?inp_index_out_segment = index
                ;       ?start_pointer = start far pointer of this area
                ;       ?orig_cx = what to return cx to
                ;       cx = index size
.common:
        push es
        push di
        push dx
        push bx
        push ax
%if _BUFFER_86MM_SLICE
        push si
        push ds
%endif

There is a slight advantage here to using the second way: The initial stack frame is actually created several times by different function entry points. These easily share the preservation by pushing registers in the .common handling. That could not be achieved as easily if the differing intro for each entry point would follow after pushing registers to preserve their values.

Other than that, there is no vast difference, no. However, keeping the prior bp value at word [bp] (second or third way) may be helpful or even needed for debuggers or other software to follow the chain of stack frames. Likewise the second way may be useful due to it keeping the return address at word [bp + 2].

Why `lea sp, [bp - x]` instead of `sub sp, x - 2*n_pushes`? I think it's the same code size either way, unless `sub` could use an `imm8` while `lea` needed a `disp16` because a few pushes made the difference. For performance on modern Intel with a stack engine (Pentium M and later), I think both would need [a "stack sync" uop](https://stackoverflow.com/questions/36631576/what-is-the-stack-engine) even for write-only use of SP in the back-end. On PPro / PIII though, reading BP instead of SP-after-more-pushes would shorten the dependency chain. Was that the reason? — Peter Cordes, Dec 01 '19 at 23:47
@Peter Cordes: If anything, I optimise for size. I decided to use `lea` because it is equally short, but does not modify flags. Seldom, I input flags into a function, or set flags in the entry point then use the `lreserve` macro, or return flags out across the `lleave` of an `lframe x, inner`. All of these are done using `lea`, except as mentioned the first which may use `push ax` (which also does not modify the flags). By the way, my stacks usually range in the 512 bytes to 1 KiB range, so creating a stack frame with more than 255 bytes would be ill-advised. — ecm, Dec 02 '19 at 07:39
Ah, preserving flags is a good reason, I hadn't thought about that difference. Thinking about compiler-generated code makes it easy to forget about all the other possibilities even if I try to keep them in mind. (The standard C calling conventions are very limited, e.g. only returning 1 value leading to API design failures like `memcmp` that discards the position of the difference. Or maybe because they're designed for a language like C.) — Peter Cordes, Dec 02 '19 at 07:46
@Peter Cordes: Both the flag preserving and another feature of using `lea` are [documented for the `lenter` macro](https://hg.ulukai.org/ecm/lmacros/file/61cdbc252795/lmacros2.mac#l331), the other one being that `lenter` after `lenter early` is implemented "using `lea sp, [bp - x]` so it doesn't matter how many of the variables were already initialised by pushing into them." (This is true of `lreserve` too.) If you wanted to use `sub sp, x - y` instead you would have to keep track of how many of the variables are already reserved stack space to determine the y. — ecm, Dec 02 '19 at 08:38
@Peter Cordes: Interestingly enough, I [actually did use `sub` for the normal `lenter` usage](https://hg.ulukai.org/ecm/lmacros/rev/d0175d3be3f4) at first. This was before `lenter early` or `lreserve` were added. — ecm, Dec 02 '19 at 08:42

transconductance · Answer 3 · 2019-12-01T23:46:19.087

1

It is more common to set up the stack frame first. This is because parameters to your function are typically found on the stack. You can access them with fixed (positive) offsets from bp. If you push other registers first, then the position of the parameters within the stack frame changes.

If you need to allocate local storage on the stack, you could subtract a constant from the sp to create an empty space and then push the other registers. This way your local storage has a (negative) offset from bp that doesn't change if you push more or less registers onto the stack.

edited Dec 01 '19 at 23:46

answered Dec 01 '19 at 21:40

transconductance

35
7

2

Function args are above the return address at *positive* offsets from BP. Also, you reserve space for locals with `sub sp, constant`, not `sub bp, const`, so they're below BP, above SP. – Peter Cordes Dec 01 '19 at 21:51
1

Absolutely correct. I've been working on a PIC micro that grows the stack upwards ;) – transconductance Dec 01 '19 at 23:45

Is there any difference between pushing registers before stack frame creation or after?

3 Answers3