why linux kernel don't restore all register when using sysenter/sysexit?

Question

In linux kernel 2.6.11, when use sysenter to do the system call, it is almost the same as init 0x80, using save_all to push all register on the kernel stack, but after the call is finished, if relevant flag is not set, we use sysexit to return, but not restore all registers which have been saved on stack.

some system call may change register value, why don't we need to resotre all registers

I've read corresponding i386 doc, it says

"All registers on the Intel386 are global and thus visible to both a calling and a called function. Registers %ebp, %ebx, %edi, %esi and %esp "belong" to the calling function. In other words, a called function must preserve these registers’ values for its caller. Remaining registers "belong" to the called function. If a calling function wants to preserve such a register value across a function call, it must save the value in its local stack frame."

So it's glibc wrapper function's responsibility to do the preservation work, and I've read some glibc code to make sure of it. So it make sense that when using sysenter/sysexit to do the system call, we first push %ebp,%edx,%ecx on user stack because %edx and %ecx are not in preservation registers, we need to restore them later after finishing system call and we also use %ebp to save user stack pointer before call system service routine, so we need to restore it to pass parameter

The compiler-generated implementations of the `sys_whatever` functions that Linux dispatches to will themselves preserve the call-preserved registers, so the dispatch code only needs to restore the call-clobbered regs before returning to user-space. — Peter Cordes, Dec 20 '20 at 23:00

score 2 · Answer 1 · edited May 23 '17 at 10:28

The reason is the same as why RCX is not used for passing parameters to system calls, being replaced with R10 in 64-bit mode: because of how the sysenter and sysexit instructions work. Namely, from Intel docs on sysexit instruction:

Prior to executing SYSEXIT, software must specify the privilege level 3 code segment and code entry point, and the privilege level 3 stack segment and stack pointer by writing values into the following MSR and general-purpose registers:

• IA32_SYSENTER_CS (MSR address 174H) — Contains a 32-bit value that is used to determine the segment selectors for the privilege level 3 code and stack segments (see the Operation section)

• RDX — The canonical address in this register is loaded into RIP (thus, this value references the first instruction to be executed in the user code). If the return is not to 64-bit mode, only bits 31:0 are loaded.

• ECX — The canonical address in this register is loaded into RSP (thus, this value contains the stack pointer for the privilege level 3 stack). If the return is not to 64-bit mode, only bits 31:0 are loaded.

Thus rdx (edx) and rcx (ecx) are reserved by the instruction. Now what about ebp? Well, from the docs on sysenter instruction:

The SYSENTER and SYSEXIT instructions are companion instructions, but they do not constitute a call/return pair. When executing a SYSENTER instruction, the processor does not save state information for the user code (e.g., the instruction pointer), and neither the SYSENTER nor the SYSEXIT instruction supports passing parameters on the stack.

This is apparent in the fact that RSP is replaced by IA32_SYSENTER_ESP on sysenter, so the OS doesn't even know where the userspace stack is supposed to be, at least this is not trivial to learn. So Linux reserves ebp exactly for this purpose: to provide the OS with the user stack. Now the caller must save ebp since it'll have to overwrite it with esp before doing sysenter.

Why didn't Linux dedicate edx or ecx for the purpose of passing stack pointer — these two registers aren't overwritten on sysenter? I think it's for speed: ebp, when used for parameter passing in usual int 0x80 calls, is the last possible (sixth) parameter. It's rare for syscalls to need more than 5 parameters, so instead of reading userspace stack for almost all system calls (if edx or ecx were used for stack pointer), Linux only has to do this for system calls with 6 parameters. (Note how you must push ebp last before doing sysenter — that's precisely because the kernel must know where to find the sixth parameter).

This all is summarized in Linux sources, arch/x86/entry/vdso/vdso32/sysenter.S:

/*
 * The caller puts arg2 in %ecx, which gets pushed. The kernel will use
 * %ecx itself for arg2. The pushing is because the sysexit instruction
 * (found in entry.S) requires that we clobber %ecx with the desired %esp.
 * User code might expect that %ecx is unclobbered though, as it would be
 * for returning via the iret instruction, so we must push and pop.
 *
 * The caller puts arg3 in %edx, which the sysexit instruction requires
 * for %eip. Thus, exactly as for arg2, we must push and pop.
 *
 * Arg6 is different. The caller puts arg6 in %ebp. Since the sysenter
 * instruction clobbers %esp, the user's %esp won't even survive entry
 * into the kernel. We store %esp in %ebp. Code in entry.S must fetch
 * arg6 from the stack.
 *
 * You can not use this vsyscall for the clone() syscall because the
 * three words on the parent stack do not get copied to the child.
 */

score 0 · Answer 2 · edited Apr 21 '15 at 03:09

0

This should be defined by the ABI(calling convention) used. Some registers are preserved across function calls while some are not. You can check out the ABI used on your platform.

As for X64, http://x86-64.org/documentation/abi.pdf documents it. See figure 3.4

preserved across calls means the register is callee-saved so a function should restore it before returning;

not preserved means caller-saved so a function may use it directly but not restore it.

edited Apr 21 '15 at 03:09

Robert Harvey

178,213
47
333
501

answered Apr 21 '15 at 02:17

tristan

4,235
2
21
45

1

Can you briefly explain figure 3.4, and how it applies to the question asked here? – Robert Harvey Apr 21 '15 at 02:19
you can post the specific assembly code and let us know which register you have problem with. :-) – tristan Apr 21 '15 at 02:53
Don't use "EDIT:" in your posts; this isn't a forum. Edit history is available [here](http://stackoverflow.com/posts/29761731/revisions). – Robert Harvey Apr 21 '15 at 03:10
System calls "almost never" use any specific ABI from any specific language or set of tools. The reason is that there's typically a privilege level change involved, and that requires major differences in either stack or register use that make conforming to any normal ABI impossible. – Brendan Dec 21 '20 at 01:17

why linux kernel don't restore all register when using sysenter/sysexit?

some system call may change register value, why don't we need to resotre all registers

2 Answers2