Writing a thunk to verify SysV ABI compliance

Question

The SysV ABI defines the C-level and assembly calling conventions for Linux.

I would like to write a generic thunk that verifies that a function satisfied the ABI restrictions on callee preserved registers and (perhaps) tried to return a value.

So given a target function like int foo(int, int) it's pretty easy3 to write such a thunk in assembly, something like¹:

foo_thunk:
push rbp
push rbx
push r12
push r13
push r14
push r15
call foo
cmp rbp, [rsp + 40]
jne bad_rbp
cmp rbx, [rsp + 32]
jne bad_rbx
cmp r12, [rsp + 24]
jne bad_r12
cmp r13, [rsp + 16]
jne bad_r13
cmp r14, [rsp + 8]
jne bad_r14
cmp r15, [rsp]
jne bad_r15
ret

Now of course I don't actually wan to write a separate foo_thunk method for each call, I just want one generic one. This one should take a pointer to the underlying function (let's say in rax), and would use an indirect call call [rax] than call foo but would otherwise be the same.

What I can't figure out is how to to implement the transparent use of the thunk at the C level (or in C++, where there seems to be more meta-programming options - but let's stick to C here). I want to take something like:

foo(1, 2);

and translate it to a call to the thunk, but still passing the same arguments in the same places (that's needed for the thunk to work).

It is expected that I modify the source, perhaps with macro or template magic, so the call above could be changed to:

CHECK_THUNK(foo, (1, 2));

Giving the macro the name of the underlying function. In principle it could translate this to²:

check_thunk(&foo, 1, 2);

How can I declare check_thunk though? The first argument is "some type" of function pointer. We could try:

check_thunk(void (*ptr)(void), ...);

So a "generic" function pointer (all pointers can validly be cast to this, and we'll only actually call it assembly, outside the claws of the language standard), plus varargs.

This doesn't work though: the ... has totally different promotion rules than a properly prototyped function. It will work for the foo(1, 2) example, but if you call foo(1.0, 2) instead, the varargs version will just leave the 1.0 as a double and you'll be calling foo with a totally wrong value (a double value punned as an integer.

The above also has the disadvantage of passing the function pointer as the first argument, which means the thunk no longer works as-is: it has to save the function pointer in rdi somewhere and then shift all the values over by one (i.e., mov rdi, rsi). If there are non-register args, things get really messy.

Is there any way to make this work smoothly?

Note: this type of thunk is basically incompatible with any passing of parameters on the stack, which is an acceptable limitation of this approach (it should simply not be used for functions with that many arguments or with MEMORY class arguments).

¹ This is checks the callee preserved registers, but the other checks are similarly straightforward.

² In fact, you don't even really need the macro for that - but it's also there so you can turn off the thunk in release builds and just do a direct call.

³ Well by "easy" I guess I mean one that doesn't work in all cases. The shown thunk doesn't correctly align the stack (easy to fix), and breaks if foo has any stack-passed arguments (significantly harder to fix).

I wonder if you could use any `plt` infrastructure for this. e.g. modify `gcc` to call through thunk wrappers instead of through the PLT? Or modify the dynamic-linker resolution stuff to resolve PLT calls to go through the thunk as well? And compile with `-fPIC` to force all(?) calls to go through the PLT. I guess you only want this for a few hand-written functions though, not for compiler output, so that would be overkill, and something per-function would be ok. — Peter Cordes, Oct 24 '17 at 08:12
@PeterCordes Even with `PIC` I'm pretty sure only calls to other objects (e.g., another `.so`) will go through the `plt`. — BeeOnRope, Oct 24 '17 at 08:14
I think externally-visible functions in the same object go through the PLT, to allow symbol interposition. See [Sorry state of dynamic libraries on Linux](http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/) — Peter Cordes, Oct 24 '17 at 08:15
Yes, I think you are right. I'm not sure if it applies to executables with -fPIE though? I'm not writing a shared object. — BeeOnRope, Oct 24 '17 at 08:17
That's why I said to compile with `-fPIC`, not `-fPIE`. Those are the compile-time code-gen options, so you don't need `-shared` at compile time, just link time (AFAIK). — Peter Cordes, Oct 24 '17 at 08:19
gcc's [`__attribute__((ifunc ("resolver")))`](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html) might be usable here. I haven't used it, but maybe you could have it return a pointer to the thunk? Hrm, no that wouldn't get a function-pointer arg passed. How many functions do you want these checks on, and could you programatically generate thunks for all of them? Maybe with asm macros to either define the symbol `foo` on the thunk and call a hidden internal `foo`, or to define `foo` on the real definition? — Peter Cordes, Oct 24 '17 at 09:04
@PeterCordes - perhaps a few dozen functions. For my current implementation I can just do the check entirely in asm, e.g., with a macro that either compiles the function bare or with the checking code (either using a thunk or just inlining the code on each function. I wanted a C solution too though since it seemed more generally useful, it's its not always practical to recompile the asm (and the more I write in C and the less in asm the better). — BeeOnRope, Oct 24 '17 at 19:22

BeeOnRope · Answer 1 · 2017-10-28T08:01:56.813

One way to do this, in a gcc-specific way, is to take advantage of typeof and nested functions to create a function pointer that embeds the call to the underlying function, but itself doesn't have any arguments.

This pointer can be passed to the thunk method, which calls it and verifies ABI compliance.

Here's an example of transforming a call to int add3(int, int, int) using this method:

The original call looks like:

int res = add3(a, b, c);

Then you wrap the call in a macro, like this²:

CALL_THUNKED(int res, add3, (a,b,c));

... which expands into something like:

    typedef typeof(add3  (a,b,c)) ret_type; 

    ret_type closure() {              
        return add3  (a,b,c);         
    }                                 
    typedef ret_type (*typed_closure)(void);  
    typedef ret_type (*thunk_t)(typed_closure); 

    thunk_t thunk = (thunk_t)closure_thunk; 
    int res = thunk(&closure);

We create the closure() function on the stack, which calls directly into add3 with the original arguments. We can take the address of this closure and pass it an asm function without difficulty: calling it will have the ultimate effect of calling add3 with the arguments¹.

The rest of the typedefs is basically dealing with the return type. We have only a single closure_thunk method, declared like this void* closure_thunk(void (*)(void)); and implemented in assembly. It takes a function pointer (any function pointer is convertible to any other), but the return type is "wrong". We cast it to thunk_t which is a dynamically generated typedef for a function that has the "right" return type.

Of course, that's certainly not legal for C functions, but we are implementing the function in asm, so we kind of sidestep the issue (if you wanted to be a bit more compliant, you could perhaps ask the asm code for a function pointer of the right type, which can "generate" it each time, outside of the reach of the standard: of course it's just returning the same pointer each time).

The closure_thunk function in asm is implemented along the lines of:

GLOBAL closure_thunk:function

closure_thunk:

push rsi
push_callee_saved

call rdi

; set up the function name
mov rdi, [rsp + 48]

; now check whether any regs were clobbered
cmp rbp, [rsp + 40]
jne bad_rbp
cmp rbx, [rsp + 32]
jne bad_rbx
cmp r12, [rsp + 24]
jne bad_r12
cmp r13, [rsp + 16]
jne bad_r13
cmp r14, [rsp + 8]
jne bad_r14
cmp r15, [rsp]
jne bad_r15

add rsp, 7 * 8
ret

That is, push all the registers we want to check on the stack (along with the function name), call the function in rdi and then do your checks. The bad_* methods aren't shown, but they basically spit out an error message like "Function add3 overwrote rbp... naughty!" and abort() the process.

This breaks if any arguments are passed on the stack, but it does work for return values passed on the stack (because the ABI for that case passes a pointer to the location for the return value in `rax).

¹ How this is accomplished is kind of magic: gcc actually writes a few bytes of executable code onto the stack, and the closure function pointer points there. The few bytes basically loads a register with a pointer to the region that contains the captured variables (a, b, c in this case), and then calls the actual (read-only) closure() code which then can access the captured variables though that pointer (and pass them to add3).

² As it turns out, we could probably use gcc's statement expression syntax to write the macro in a more usual function like syntax, something like int res = CALL_THUNKED(add3, (a,b,c)).

I just tried it on Godbolt, because I was trying to remember exactly how gcc used `r10` for nested functions. https://godbolt.org/g/aS4S5M IDK if it's the same thing as a "static chain pointer", but it's only used between the trampoline on the stack and the closure, so no other code has to treat it as call-preserved. And BTW, on ARM after writing the trampoline to the stack it has to call `__clear_cache` because most non-x86 architectures don't have coherent I-cache. — Peter Cordes, Oct 26 '17 at 08:49
Yeah it's one where it's hard to see what's going in godbolt, because you only see the code generation code, not the code it builds (you'd have to mentally work out what bytes end up on the stack and decode that...). I just used gdb and stepped into the trampoline. — BeeOnRope, Oct 26 '17 at 08:55

Peter Cordes · Answer 2 · 2017-10-24T09:13:42.050

0

At the C source level (without modifying gcc or the linker to insert the thunk for you), you could define different prototypes for each thunk but still share the same implementation.

You could put multiple labels on the definition in the asm source, so check_thunk_foo has the same address as check_thunk_bar, but you can use a different C prototype for each.

Or you could make weak aliases like this:

int check_thunk_foo(void*, int, int) 
    __attribute__ ((weak, alias ("check_thunk_generic")));
// or maybe this should be ((weakref ("check_thunk_generic")))

#define foo(...) check_thunk_foo((void*)&foo, __VA_ARGS__)

// or to put the args in their original slots,
// but then you'd need different thunks for different numbers of integer args.
#define foo(x, y) check_thunk_foo((x), (y), (void*)&foo)

The major downside to this is that you need to copy+modify the original prototype for every function. You could hack this up with CPP macros so there's a single point of definition for the arg list, and the real prototype (and the thunk if enabled) both use it. Possibly by re-including the same .h twice, with a wrapper macro defined differently. Once for the real prototypes, again for the thunks.

BTW, passing the function pointer as an extra arg to a generic thunk is potentially problematic. I think it's not possible to reliably remove the first arg and forward the rest in the x86-64 SysV ABI. You don't know how many stack args there are, for functions that take more than 6 integer args. And you don't know if there are FP stack args before the first integer stack arg.

This should work fine for functions that pass all their register-possible args in registers. (i.e. if there are any stack args, they're large structs by value or other things that couldn't go in an integer register.)

To solve this problem, the thunk could dispatch based on return address instead of an extra hidden arg, if you had something like debug info to map call site return addresses to call targets. Or you could maybe get gcc to pass a hidden arg in rax or r11. Running call from inline asm sucks a lot, so you'd maybe need to customize gcc with support for some special attribute that passed a function pointer in an extra register.

but if you call foo(1.0, 2) instead, the varargs version will just leave the 1.0 as a double and you'll be calling foo with a totally wrong value (a double value punned as an integer.

Not that it matters, but no, you'd be calling foo(2, garbage) with xmm0=(double)1.0. Variadic functions still use register args the same as non-variadic functions (or with the option of passing FP args on the stack before you run out of registers, and setting al= less than 8).

edited Oct 24 '17 at 09:13

answered Oct 24 '17 at 08:45

Peter Cordes

328,167
45
605
847

1

Right, the problem is I want a generic thunk at the C++ level too, not only because it's convenient, but because in a major use case the called function is actually a template parameter, hence unknown. Good point about `double` being even worse than I claimed. I'm OK if it fails at compile time if some reasonable number of args is exceeded - but finding a better way to pass the function pointer would be nice too (so having multiple thunks can be OK too, perhaps they can delegate to a master thunk after stuffing the function pointer somewhere common. – BeeOnRope Oct 24 '17 at 08:49
@BeeOnRope: I know this isn't a *good* answer, but posting an answer seemed like the right place to type up my first viable thought. You might need to hack the toolchain to make this more generic for the template use-case you describe. Or maybe symbol interposition can do it if you just have a list of all your asm function names. You could generate wrappers for them and LD_PRELOAD that. – Peter Cordes Oct 24 '17 at 08:53
I'm definitely not down for hacking the toolchain, it should work with the standard tools. Using interposition is an interesting idea. I realized that _any_ type of thunk like what I had in the OP can't work with stack args because the extra `call` means the stack is not as the callee expects: it will look for stack args in the wrong place. – BeeOnRope Oct 24 '17 at 23:20
One solution would be to save away "somewhere" the return address of the caller in the thunk, and then overwrite the return address with the address inside the thunk (e.g., `add rsp, 8; call foo`). Then the called function will return to the think (needed for the verification step) and we do our checks and finally `jmp` back to the caller using the saved address. That exact flow breaks the return stack predictor, so maybe it's to organize it as `jmp + jmp` or `call + ret`: either is possible. – BeeOnRope Oct 24 '17 at 23:23
@BeeOnRope: Yeah, was going to suggest something like that as a hack for stack args. For leaf functions, you might get away with using TLS to stash some saved stuff. Or just really far *down* the stack if you know your callee consumes limited stack space and you have no signal handlers. And yeah, overwriting the return address in place is sensible. To return from the thunk without breaking the return stack, `push qword [fs:thunk_saved_ret]` / `ret` – Peter Cordes Oct 24 '17 at 23:28
Yeah, a separate stack pointed to by TLS is what I'm converging on as the gold-plated solution (for now I know none of my asm functions use stack args, so I'll probably just use the main stack). I imagine using TLS from asm is a giant PITA (as a practical matter I guess you can't do it portably because gcc and the linker coordinate to make TLS work in C/C++ programs and it depends on the compile options, the number of TLS variables, etc). So I'd probably call back into a C method from asm to get the side-band stack. – BeeOnRope Oct 25 '17 at 00:03
Or if we figure out how to pass some value (like the function pointer) to the asm in the first place, then the C code can just pass the storage directly into the asm thunk (and it can put the ultimate function to call in the storage too). One way would be with the `asm("rax")` syntax to force a local var into a specific register, but I don't know much about it. – BeeOnRope Oct 25 '17 at 00:04
1

I wrote an asm-only version of the thunk [here](https://github.com/travisdowns/nasm-utils/blob/master/nasm-util-inc.asm). It doens't handle stack args. – BeeOnRope Oct 25 '17 at 09:20
1

I tested `register void *volatile fp asm("rax") = &func;` It doesn't work at all if the variable is unused. It might work without `volatile` if the variable is used, but gcc doesn't stop itself from using `eax` as a temporary in calculating a function arg. (There might be a way to use this in an inline function so the register variable has limited scope at least). Using `asm("" :: "a"(&ext));` leaves a function pointer in RAX, where it *might* stay there until `call thunk`. https://godbolt.org/g/y9A44E. This doesn't avoid needing prototypes though. – Peter Cordes Oct 25 '17 at 22:42
Boo. Seems like it's documented [not to work](https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables). Global register values seem [more promising](https://gcc.gnu.org/onlinedocs/gcc/Global-Register-Variables.html#Global-Register-Variables) though, you'd just need to put the thunk function in its own compilation unit. So perhaps the globally visible function stashes the &func somewhere then calls into the helper which can use the global register. I don't really have an intuitive sense of how this works though (e.g., threads). – BeeOnRope Oct 25 '17 at 22:57
@BeeOnRope: "stashes somewhere". Any way of solving that problem for a call across a compilation unit to a C wrapper could be used to call the ABI-checking asm function directly. Except that I guess it might let you use TLS more easily / portably. But do you really need this to be re-entrant from within the same thread? If not, just use a scalar TLS variable instead of a TLS stack. – Peter Cordes Oct 26 '17 at 01:47
_Any way of solving that problem for a call across a compilation unit to a C wrapper could be used to call the ABI-checking asm function directly._ - well not exactly, because in C you have all the power of C (and gcc extensions) to do the stashing, but at the C-asm boundary you are more restricted to things that are defined by the ABI (basically normal function calls). For example, in C-land you you call into the a segregated helper function with the target func pointer, which then uses a gcc local function to close over the func pointer, returning a func pointer to the caller, which ... – BeeOnRope Oct 26 '17 at 02:03
... then calls it with the arguments to the original function. The helper function now has access to the original function pointer which it closed over. That's hard in assembly because local functions are some gcc magic. This example doesn't actually work as-is because the closure is dead as soon as you return it (once you've left the scope), but maybe it can be fixed with two layers of local functions. A more practical example is TLS: good luck using that even semi-portably from assembly :( – BeeOnRope Oct 26 '17 at 02:05

Writing a thunk to verify SysV ABI compliance

2 Answers2

Linked