Work around windows calling convention preserving xmm registers?

Question

Is there any way on Windows to work around the requirement that the XMM registers are preserved within a function call?(Aside from writing it all in assembly)

I have many AVX2 intrinsic functions that are unfortunately bloated by this.

As an example this will be placed at the top of the function by the compiler(MSVC):

00007FF9D0EBC602 vmovaps xmmword ptr [rsp+1490h],xmm6
00007FF9D0EBC60B vmovaps xmmword ptr [rsp+1480h],xmm7
00007FF9D0EBC614 vmovaps xmmword ptr [rsp+1470h],xmm8
00007FF9D0EBC61D vmovaps xmmword ptr [rsp+1460h],xmm9
00007FF9D0EBC626 vmovaps xmmword ptr [rsp+1450h],xmm10
00007FF9D0EBC62F vmovaps xmmword ptr [rsp+1440h],xmm11
00007FF9D0EBC638 vmovaps xmmword ptr [rsp+1430h],xmm12
00007FF9D0EBC641 vmovaps xmmword ptr [rsp+1420h],xmm13
00007FF9D0EBC64A vmovaps xmmword ptr [rsp+1410h],xmm14
00007FF9D0EBC653 vmovaps xmmword ptr [rsp+1400h],xmm15

And then at the end of the function..

00007FF9D0EBD6E6 vmovaps xmm6,xmmword ptr [r11-10h]
00007FF9D0EBD6EC vmovaps xmm7,xmmword ptr [r11-20h]
00007FF9D0EBD6F2 vmovaps xmm8,xmmword ptr [r11-30h]
00007FF9D0EBD6F8 vmovaps xmm9,xmmword ptr [r11-40h]
00007FF9D0EBD6FE vmovaps xmm10,xmmword ptr [r11-50h]
00007FF9D0EBD704 vmovaps xmm11,xmmword ptr [r11-60h]
00007FF9D0EBD70A vmovaps xmm12,xmmword ptr [r11-70h]
00007FF9D0EBD710 vmovaps xmm13,xmmword ptr [r11-80h]
00007FF9D0EBD716 vmovaps xmm14,xmmword ptr [r11-90h]
00007FF9D0EBD71F vmovaps xmm15,xmmword ptr [r11-0A0h]

That is 20 instructions that do nothing since I have no need to preserve the state of XMM. I have 100's of these functions that the compiler is bloating up like this. They are all invoked from the same call site via function pointers.

I tried changing the calling convention(__vectorcall/cdecl/fastcall) but that doesn't appear to do anything.

Typically, these intrinsic functions are inlined. Why is that not the case in your code? — fuz, May 17 '19 at 15:08
It is VM being fed runtime compiled sequences, also many of these functions are not small — Froglegs, May 17 '19 at 15:11
Are you sure compiler optimisations are turned on? And what do you mean by “VM?” If the functions are not small, what is the problem with saving and restoring registers? — fuz, May 17 '19 at 15:23
The calling code assumes that the registers are preserved. If you don't preserve them, the calling function (or it's caller, or its caller's caller, ...) may behave erratically. — Raymond Chen, May 17 '19 at 15:30
@fuz: the OP means whole functions *using* intrinsics, not that they're getting this around each individual intrinsic. But yes, inlining is the solution, so you have a few big functions amortizing the cost of save/restore. The OP is defeating that with function pointers because they're apparently writing an interpreting virtual machine (VM). — Peter Cordes, May 17 '19 at 15:34
@fuz perhaps assume I'm not a moron? Of course optimizations are on-- While the overall impact of this register restoring is small, It is completely pointless in my case-- so I'd like to remove it — Froglegs, May 17 '19 at 16:09
@Froglegs I do not make any assumptions about your mental capacity. But given that you provided zero details and no [MCVE] in your question, I have to guess what the actual situation is. Perhaps try to ask a better question next time instead of feeling insulted by other people's attempts to help you. — fuz, May 17 '19 at 16:16
Most interpreters don't do what your code is doing because calling hundreds of possible functions from a single call site is often results in almost every call being mispredicted by branch prediction leading to huge stalls. So most interpreters at least try to partially inline functions and use tricks like so called "threaded-code" using computer gotos to improve prediction. Worrying about the cost of these registers being saved is probably premature at this point. — Ross Ridge, May 17 '19 at 16:24
@Ross Ridge MSVC does not support computed goto unfortunately. Huge stalls is relative, in my case these functions internally loop N(32) times to amortize the cost of the function call and stay in the uop cache etc. I'm well aware that this register restoring is a minor cost, but I'd still like to remove it at some point. — Froglegs, May 17 '19 at 16:31
@RossRidge: modern branch predictors do a lot better with a single indirect-branch dispatch than in the past. ITTAGE (Haswell) can spread the prediction for a single branch over many bits in the BTB because it indexes based on branch history. I think AMD's perceptron predictors are also good? See this paper from 2015: [Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore](https://hal.inria.fr/hal-01100647/document) But yes in the past this was a huge problem: [X86 prefetching optimizations: "computed goto" threaded code](//stackoverflow.com/q/46321531) — Peter Cordes, May 17 '19 at 16:51
(Partial) inlining probably still has benefits. Of course full JIT can be even better, especially if you do any actual optimization between the blocks you paste together. — Peter Cordes, May 17 '19 at 16:53

Peter Cordes · Accepted Answer · 2019-05-17T16:11:24.810

Use the x86-64 System V calling convention for your helper functions that you want to piece together via function pointers. In that calling convention, all of xmm/ymm0..15 and zmm0..31 are call-clobbered so even helper functions that need more than 5 vector registers don't have to save/restore any.

The outer interpreter function that calls them should still use Windows x64 fastcall or vectorcall, so from the outside it fully respects that calling convention.

This will hoist all the save/restore of XMM6..15 into that caller, instead of each helper function. This reduces static code size and amortizes the runtime cost over multiple calls through your function pointers.

AFAIK, MSVC doesn't support marking functions as using the x86-64 System V calling convention, only fastcall vs. vectorcall, so you'll have to use clang.

(ICC is buggy and fails to save/restore XMM6..15 around a call to a System V ABI function).

Windows GCC is buggy with 32-byte stack alignment for spilling __m256, so it's not in general safe to use GCC with -march= with anything that includes AVX.

Use __attribute__((sysv_abi)) or __attribute__((ms_abi)) on function and function-pointer declarations.

I think ms_abi is __fastcall, not __vectorcall. Clang may support __attribute__((vectorcall)) as well, but I haven't tried it. Google results are mostly feature requests/discussion.

void (*helpers[10])(float *, float*) __attribute__((sysv_abi));

__attribute__((ms_abi))
void outer(float *p) {
    helpers[0](p, p+10);
    helpers[1](p, p+10);
    helpers[2](p+20, p+30);
}

compiles as follows on Godbolt with clang 8.0 -O3 -march=skylake. (gcc/clang on Godbolt target Linux, but I used explicit ms_abi and sysv_abi on both the function and function-pointers so the code gen doesn't depend on the fact that the default is sysv_abi. Obviously you'd want to build your function with a Windows gcc or clang so calls to other functions would use the right calling convention. And a useful object-file format, etc.)

Notice that gcc/clang emit code for outer() that expects the incoming pointer arg in RCX (Windows x64), but pass it to the callees in RDI and RSI (x86-64 System V).

outer:                                  # @outer
        push    r14
        push    rsi
        push    rdi
        push    rbx
        sub     rsp, 168
        vmovaps xmmword ptr [rsp + 144], xmm15 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 128], xmm14 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 112], xmm13 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 96], xmm12 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 80], xmm11 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 64], xmm10 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 48], xmm9 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 32], xmm8 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 16], xmm7 # 16-byte Spill
        vmovaps xmmword ptr [rsp], xmm6 # 16-byte Spill
        mov     rbx, rcx                            # save p 
        lea     r14, [rcx + 40]
        mov     rdi, rcx
        mov     rsi, r14
        call    qword ptr [rip + helpers]
        mov     rdi, rbx
        mov     rsi, r14
        call    qword ptr [rip + helpers+8]
        lea     rdi, [rbx + 80]
        lea     rsi, [rbx + 120]
        call    qword ptr [rip + helpers+16]
        vmovaps xmm6, xmmword ptr [rsp] # 16-byte Reload
        vmovaps xmm7, xmmword ptr [rsp + 16] # 16-byte Reload
        vmovaps xmm8, xmmword ptr [rsp + 32] # 16-byte Reload
        vmovaps xmm9, xmmword ptr [rsp + 48] # 16-byte Reload
        vmovaps xmm10, xmmword ptr [rsp + 64] # 16-byte Reload
        vmovaps xmm11, xmmword ptr [rsp + 80] # 16-byte Reload
        vmovaps xmm12, xmmword ptr [rsp + 96] # 16-byte Reload
        vmovaps xmm13, xmmword ptr [rsp + 112] # 16-byte Reload
        vmovaps xmm14, xmmword ptr [rsp + 128] # 16-byte Reload
        vmovaps xmm15, xmmword ptr [rsp + 144] # 16-byte Reload
        add     rsp, 168
        pop     rbx
        pop     rdi
        pop     rsi
        pop     r14
        ret

GCC makes basically the same code. But Windows GCC is buggy with AVX.

ICC19 makes similar code, but without the save/restore of xmm6..15. This is a showstopper bug; if any of the callees do clobber those regs like they're allowed to, then returning from this function will violate its calling convention.

This leaves clang as the only compiler that you can use. That's fine; clang is very good.

If your callees don't need all the YMM registers, saving/restoring all of them in the outer function is overkill. But there's no middle ground with existing toolchains; you'd have to hand-write outer in asm to take advantage of knowing that none of your possible callees ever clobber XMM15 for example.

Note that calling other MS-ABI functions from inside outer() is totally fine. GCC / clang will (barring bugs) emit correct code for that too, and it's fine if a called function chooses not to destroy xmm6..15.

Work around windows calling convention preserving xmm registers?

1 Answers1

Linked