The second way, push bp ; mov bp, sp before pushing any more registers, means your first stack arg is always at [bp+4] regardless of how many more pushes you do1. This doesn't matter if you passed all the args in registers instead of on the stack, which is easier and more efficient most of the time if you only have a couple.
This is good for maintainability by humans; you can change how many registers you save/restore without changing how you access args. But you do still have to avoid the space right below BP; saving more regs means you might put the highest local var at [bp-6] instead of [bp-4].
Footnote: A "far proc" has a 32-bit CS:IP return address so args start at [bp+6] in that case. See @MichaelPetch's comments about letting tools like MASM sort this out for you with symbolic names for args and local vars.
Also, for backtracing up the call stack, it means that your caller's bp value points a saved BP value in your caller's stack frame, forming a linked list of BP / ret-addr values a debugger can follow. Doing more pushes before mov bp,sp would leave BP pointing elsewhere. See also When do we create base pointer in a function - before or after local variables? for more details about this, on a very similar question for 32-bit mode. (Note that 32 and 64-bit code can use [esp +- x] addressing modes, but 16-bit code can't. 16-bit code is basically forced to set up BP as a frame pointer to access its own stack frame.)
I stack-traces are one of the primary reasons for mov bp,sp right after push bp being the standard convention. As opposed to some other equally valid convention like doing all your pushes and then mov bp,sp.
If you push bp last, you can use the leave instruction before pop/pop/ret in the epilogue. (It depends on BP pointing to the saved-BP value).
The leave instruction can save code-size as a compact version of mov sp,bp ; pop bp. (It's not magic, that's all it does. It's totally fine to not use it. And enter is very slow on modern x86, never use it.) You can't really use leave if you have other pops to do first. After add sp, whatever to point SP at your saved BX value, you do pop bx and then you might as well just use pop bp instead of leave. So leave is only useful in a function that makes a stack frame but doesn't push any other registers after. But does reserve some extra space with sub sp, 20 for example, so sp isn't still pointing at something you want to pop.
Or you might use something like this so offsets to stack args and to locals are independent of how many registers you push/pop other than BP. I don't see any obvious downside to this but maybe there's some reason I missed why it's not the usual convention.
func:
push bp
mov bp,sp
sub sp, 16 ; space for locals from [bp-16] to [bp-1]
push bx ; save some call-preserved regs *below* that
push si
... function body
pop si
pop bx
leave ; mov sp, bp; pop bp
ret
Modern GCC tends to save any call-preserved regs before sub esp, imm. e.g.
void ext(int); // non-inline function call to give GCC a reason to save/restore a reg
void foo(int arg1) {
volatile int x = arg1;
ext(1);
ext(arg1);
x = 2;
// return x;
}
gcc9.2 -m32 -O3 -fno-omit-frame-pointer -fverbose-asm on Godbolt
foo(int):
push ebp #
mov ebp, esp #,
push ebx # save a call-preserved reg
sub esp, 32 #,
mov ebx, DWORD PTR [ebp+8] # arg1, arg1 # load stack arg
push 1 #
mov DWORD PTR [ebp-12], ebx # x = arg1
call ext(int) #
mov DWORD PTR [esp], ebx #, arg1
call ext(int) #
mov DWORD PTR [ebp-12], 2 # x,
mov ebx, DWORD PTR [ebp-4] #, ## restore EBX with mov instead of pop
add esp, 16 #, ## missed optimization, let leave do this
leave
ret
Restoring the call-preserved registers with mov instead of pop lets GCC still use leave. If you tweak the function to return a value, GCC avoids the wasted add esp,16.
BTW, you can shorten your code by letting functions destroy at least AX without saving/restoring. i.e. treat them as call-clobbered, aka volatile. Normal 32-bit calling conventions have EAX, ECX, and EDX volatile (like what GCC is compiling for in the example above: Linux's i386 System V), but many different 16-bit conventions exist which are different.
Having one of SI, DI, or BX volatile would let functions access memory without needing to push/pop their caller's copy of it.
Agner Fog's calling convention guide includes some standard 16-bit calling conventions, see the table at the start of chapter 7 for 16-bit conventions used by existing C/C++ compilers. @MichaelPetch suggests the Watcom convention: AX and ES are always call-clobbered, but args are passed in AX, BX, CX, DX. Any reg used for arg-passing is also call-clobbered. And so is SI when used to pass a pointer to where the function should store a large return-value.
Or at the extreme, choose a custom calling convention on a per-function basis, according to what's most efficient for that function and for its callers. But that would quickly become a maintenance nightmare; if you want that kind of optimization just use a compiler and let it inline short functions and optimize them into the caller, or do inter-procedural optimization based on which registers are actually used by a function.