Who sets the RIP register when you call the clone syscall?

Question

I am trying to implement a minimal kernel and I am trying to implement the clone syscall. In the man pages you can see the clone syscall defined as such:

int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
                 /* pid_t *parent_tid, void *tls, pid_t *child_tid */ );

As you can see, it receives a function pointer. If you read the man page more closely you can actually see that the actual syscall implementation in the kernel does not receive a function pointer:

long clone(unsigned long flags, void *stack,
                      int *parent_tid, int *child_tid,
                      unsigned long tls);

So, my question is, who modifies the RIP register after a thread is created? Is it the libc?

I found this code in glibc: https://elixir.bootlin.com/glibc/latest/source/sysdeps/unix/sysv/linux/x86_64/clone.S but I am not sure at what point the function is actually called.

Extra information:

When looking at the clone.S source code you can see that it jumps to a thread_start branch after the syscall. On the branch after the clone syscall (so only the child does this) it pops the function address and the arguments from the stack. Who actually pushed these arguments and the function address on the stack? I guess it has to happen somewhere in the kernel because at the point of the syscall instruction they were not there.

Here is some gdb output:

Right before the syscall:

[-------------------------------------code-------------------------------------]
   0x7ffff7d8af22 <clone+34>:   mov    r8,r9
   0x7ffff7d8af25 <clone+37>:   mov    r10,QWORD PTR [rsp+0x8]
   0x7ffff7d8af2a <clone+42>:   mov    eax,0x38
=> 0x7ffff7d8af2f <clone+47>:   syscall 
   0x7ffff7d8af31 <clone+49>:   test   rax,rax
   0x7ffff7d8af34 <clone+52>:   jl     0x7ffff7d8af49 <clone+73>
   0x7ffff7d8af36 <clone+54>:   je     0x7ffff7d8af39 <clone+57>
   0x7ffff7d8af38 <clone+56>:   ret
Guessed arguments:
arg[0]: 0x3d0f00 
arg[1]: 0x7ffff8020b60 --> 0x7ffff7d3fb30 (<do_something>:  push   rbx)
arg[2]: 0x7fffffffda90 --> 0x0 
[------------------------------------stack-------------------------------------]
0000| 0x7fffffffda78 --> 0x7ffff7d3f52c (<main+172>:    pop    rsi)
0008| 0x7fffffffda80 --> 0x7fffffffda94 --> 0x73658b0000000000 
0016| 0x7fffffffda88 --> 0x7fffffffda94 --> 0x73658b0000000000 
0024| 0x7fffffffda90 --> 0x0 
0032| 0x7fffffffda98 --> 0x492e085573658b00 
0040| 0x7fffffffdaa0 --> 0x7ffff7d3f0d0 (<_init>:   sub    rsp,0x8)
0048| 0x7fffffffdaa8 --> 0x7ffff7d40830 (<__libc_csu_init>: push   r15)
0056| 0x7fffffffdab0 --> 0x7ffff7d408d0 (<__libc_csu_fini>: push   rbp)
[------------------------------------------------------------------------------]

After the syscall instruction on the child thread (check the top of the stack - this does not happen on the parent's thread):

[-------------------------------------code-------------------------------------]
   0x7ffff7d8af25 <clone+37>:   mov    r10,QWORD PTR [rsp+0x8]
   0x7ffff7d8af2a <clone+42>:   mov    eax,0x38
   0x7ffff7d8af2f <clone+47>:   syscall 
=> 0x7ffff7d8af31 <clone+49>:   test   rax,rax
   0x7ffff7d8af34 <clone+52>:   jl     0x7ffff7d8af49 <clone+73>
   0x7ffff7d8af36 <clone+54>:   je     0x7ffff7d8af39 <clone+57>
   0x7ffff7d8af38 <clone+56>:   ret    
   0x7ffff7d8af39 <clone+57>:   xor    ebp,ebp
[------------------------------------stack-------------------------------------]
0000| 0x7ffff8020b60 --> 0x7ffff7d3fb30 (<do_something>:    push   rbx)
0008| 0x7ffff8020b68 --> 0x7ffff7dd5add --> 0x4c414d0074736574 ('test')
0016| 0x7ffff8020b70 --> 0x0 
0024| 0x7ffff8020b78 --> 0x411 
0032| 0x7ffff8020b80 ("Parameters: 0x7ffff7d3fb30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
0040| 0x7ffff8020b88 ("rs: 0x7ffff7d3fb30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
0048| 0x7ffff8020b90 ("fff7d3fb30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
0056| 0x7ffff8020b98 ("30 4001536 0x7ffff8020b70 0x7fffffffda90 0x7ffff8000b60 0x7fffffffda94\n")
[------------------------------------------------------------------------------]

Lines 92-95 in the linked libc source marked with `/* Function to call. */` ? — , Mar 29 '21 at 12:23

user123 · Accepted Answer · 2021-03-30T18:16:39.217

Normally the way it works is that, when the computer boots, Linux sets up a MSR (Model Specific Register) to work with the assembly instruction syscall. The assembly instruction syscall will make the RIP register jump to the address specified in the MSR to enter kernel mode. As stated in 64-ia-32-architectures-software-developer-vol-2b-manual from Intel:

SYSCALL invokes an OS system-call handler at privilege level 0. It does so by loading RIP from the IA32_LSTAR MSR

Once in kernel mode, the kernel will look at the arguments passed into conventional registers (RAX, RBX etc.) to determine what the syscall is asking. Then the kernel will invoke one of the sys_XXX functions whose prototypes are in linux/syscalls.h (https://elixir.bootlin.com/linux/latest/source/include/linux/syscalls.h#L217). The definition of sys_clone is in kernel/fork.c.

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
         int __user *, parent_tidptr,
         int __user *, child_tidptr,
         unsigned long, tls)
#endif
{
    return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
}

The SYSCALLDEFINE5 macro takes the first argument and prefixes sys_ to it. This function is actually sys_clone and it calls _do_fork.

It means there really isn't a clone() function which is invoked by glibc to call into the kernel. The kernel is called with the syscall instruction, it jumps to an address specified in the MSR and then it invokes one of the syscalls in the sys_call_table.

The entry point to the kernel for x86 is here: https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/entry_64.S. If you scroll down you'll see the line: call *sys_call_table(, %rax, 8). Basically, call one of the functions of the sys_call_table. The implementation of the sys_call_table is here: https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscall_64.c#L20.

// SPDX-License-Identifier: GPL-2.0
/* System call table for x86-64. */

#include <linux/linkage.h>
#include <linux/sys.h>
#include <linux/cache.h>
#include <linux/syscalls.h>
#include <asm/unistd.h>
#include <asm/syscall.h>

#define __SYSCALL_X32(nr, sym)
#define __SYSCALL_COMMON(nr, sym) __SYSCALL_64(nr, sym)

#define __SYSCALL_64(nr, sym) extern long __x64_##sym(const struct pt_regs *);
#include <asm/syscalls_64.h>
#undef __SYSCALL_64

#define __SYSCALL_64(nr, sym) [nr] = __x64_##sym,

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
    /*
     * Smells like a compiler bug -- it doesn't work
     * when the & below is removed.
     */
    [0 ... __NR_syscall_max] = &__x64_sys_ni_syscall,
#include <asm/syscalls_64.h>
};

I recommend you read the following: https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-2.html. On this website is stated that

As you can see, we include the asm/syscalls_64.h header at the end of the array. This header file is generated by the special script at arch/x86/entry/syscalls/syscalltbl.sh and generates our header file from the syscall table (https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/entry/syscalls/syscall_64.tbl).

...

...

So, after this, our sys_call_table takes the following form:
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
   [0 ... __NR_syscall_max] = &sys_ni_syscall,
   [0] = sys_read,
   [1] = sys_write,
   [2] = sys_open,
   ...
   ...
   ...
};

Once you have the table generated, one of its entries is being jumped to when you use the syscall assembly instruction. For clone() it will call sys_clone() which itself calls _do_fork(). Which is defined as such:

long _do_fork(unsigned long clone_flags,
          unsigned long stack_start,
          unsigned long stack_size,
          int __user *parent_tidptr,
          int __user *child_tidptr,
          unsigned long tls)
{
    struct task_struct *p;
    int trace = 0;
    long nr;

    /*
     * Determine whether and which event to report to ptracer.  When
     * called from kernel_thread or CLONE_UNTRACED is explicitly
     * requested, no event is reported; otherwise, report if the event
     * for the type of forking is enabled.
     */
    if (!(clone_flags & CLONE_UNTRACED)) {
        if (clone_flags & CLONE_VFORK)
            trace = PTRACE_EVENT_VFORK;
        else if ((clone_flags & CSIGNAL) != SIGCHLD)
            trace = PTRACE_EVENT_CLONE;
        else
            trace = PTRACE_EVENT_FORK;

        if (likely(!ptrace_event_enabled(current, trace)))
            trace = 0;
    }

    p = copy_process(clone_flags, stack_start, stack_size,
             child_tidptr, NULL, trace, tls);
    /*
     * Do this prior waking up the new thread - the thread pointer
     * might get invalid after that point, if the thread exits quickly.
     */
    if (!IS_ERR(p)) {
        struct completion vfork;
        struct pid *pid;

        trace_sched_process_fork(current, p);

        pid = get_task_pid(p, PIDTYPE_PID);
        nr = pid_vnr(pid);

        if (clone_flags & CLONE_PARENT_SETTID)
            put_user(nr, parent_tidptr);

        if (clone_flags & CLONE_VFORK) {
            p->vfork_done = &vfork;
            init_completion(&vfork);
            get_task_struct(p);
        }

        wake_up_new_task(p);

        /* forking complete and child started to run, tell ptracer */
        if (unlikely(trace))
            ptrace_event_pid(trace, pid);

        if (clone_flags & CLONE_VFORK) {
            if (!wait_for_vfork_done(p, &vfork))
                ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
        }

        put_pid(pid);
    } else {
        nr = PTR_ERR(p);
    }
    return nr;
}

It calls wake_up_new_task() which puts the task on the runqueue and wakes it. I'm surprised it even wakes the task immediatly. I would have guessed that the scheduler would have done it instead and that it would have been given a high priority to run as soon as possible. In itself, the kernel doesn't have to receive a function pointer because as stated on the manpage for clone():

The raw clone() system call corresponds more closely to fork(2) in that execution in the child continues from the point of the call. As such, the fn and arg arguments of the clone() wrapper function are omitted.

The child continues execution where the syscall was made. I don't understand exactly the mechanism but in the end the child will continue execution in a new thread. The parent thread (which created the new child thread) returns and the child thread jumps to the specified function instead.

I think it works with the following lines (on the link you provided):

testq   %rax,%rax
jl  SYSCALL_ERROR_LABEL
jz  L(thread_start) //Child jumps to thread_start

ret //Parent returns to where it was

Because rax is a 64 bits register, they use the 'q' version of the GNU syntax assembly instruction test. They test if rax is zero. If it is less than zero then there was an error. If it is zero then jump to thread_start. If it is not zero nor negative (in the case of the parent thread), continue execution and return. The new thread is created with rax as 0. It allows to diffenrentiate between the parent and the child thread.

EDIT

As stated on the link you provided,

The parameters are passed in register and on the stack from userland:
rdi: fn
rsi: child_stack
rdx: flags
rcx: arg
r8d: TID field in parent
r9d: thread pointer

So when your program executes the following lines:

/* Insert the argument onto the new stack.  */
subq    $16,%rsi
movq    %rcx,8(%rsi)

/* Save the function pointer.  It will be popped off in the
      child in the ebx frobbing below.  */
movq    %rdi,0(%rsi)

it inserts the function pointer and arguments onto the new stack. Then it calls the kernel which itself doesn't have to push anything onto the stack. It just receives the new stack as an argument and then makes the child's thread RSP register point to it. I would guess this happens in the copy_process() function (called from fork()) along the lines of:

retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
    goto bad_fork_cleanup_io;

It seems to be done in the copy_thread_tls() function which itself calls copy_thread(). copy_thread() has its prototype in include/linux/sched.h and it is defined based on the architecture. I'm not sure where it is defined for x86.

Hey @user123 thank you for you detailed answer. I still have one point of confusion though. After the clone syscall returns, if the rax has value 0 then it will actually jump to the `thread_start` label. The `thread_start` label does two pops (one being the address of the function and the other one the arguments) and then it start executing that function by doing a `call %rax`. The thing is, before the syscall these two values are not on stack and are on registers, so the kernel has to put them. Do you have any idea where this happens? — danield, Mar 30 '21 at 11:49
In the end it is here: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/process.c#L125 — user123, Mar 30 '21 at 18:21
(Where the stack pointer is set in the task_struct). The wake_up_new_task() function will probably set the RSP register of the processor based on the field in the task_struct which represents the stack_pointer. In the end it is a function which calls another which calls another so you can't really go up the whole chain because you'd have to look at the whole kernel. — user123, Mar 30 '21 at 18:36

score 3 · Answer 2 · answered Apr 21 '21 at 04:02

Yes, libc; the kernel interface is like fork: it returns twice to the same place, but with different return values. (0 in the child or a PID/TID in the parent). The man page documents the glibc wrapper vs. kernel differences, like for other system calls where there's a difference.

The libc wrapper stashes the function pointer and arg you pass in the new thread's stack space, where the new thread can load it. (The kernel starts it with its RSP set to the void *stack arg passed to clone(), so it doesn't have access to old locals in stack memory or registers, and using a global wouldn't be thread-safe if multiple threads are cloning themselves at the same time.)

Note that there's also a clone3 system call that takes a struct arg, and is also more like the raw kernel interface for clone. (Or at least there is no glibc wrapper for it.)

Who sets the RIP register when you call the clone syscall?

2 Answers2

Linked