0

I have noticed that GCC generates very different (and less efficient) code when it is given a union of an SIMD vector type and any other same-size and same-alignment type that is not a vector type.

In particular, as can be seen in this Godbolt example, when an __m128 vector type is placed in a union with a non-vector type, the union is passed in two XMM registers (per argument) then loaded onto the stack for use with addps, as opposed to being passed in a single XMM register and used with addps directly. On the other hand, for the other two cases with a union containing only __m128 and the __m128 vector itself, the arguments and return are passed in XMM registers directly and no stack is used.

What causes this discrepancy? Is there a way to "force" GCC to pass the multi-element union in XMM registers?

With union:

#include <immintrin.h>
#include <array>

union simd
{
    __m128 vec;
    alignas(__m128) std::array<float, 4> values; 
};

simd add(simd a, simd b) noexcept
{
    simd ret;
    ret.vec = _mm_add_ps(a.vec, b.vec);
    return ret;
}
add(simd, simd):
        movq    QWORD PTR [rsp-40], xmm0
        movq    QWORD PTR [rsp-32], xmm1
        movq    QWORD PTR [rsp-24], xmm2
        movq    QWORD PTR [rsp-16], xmm3
        movaps  xmm4, XMMWORD PTR [rsp-24]
        addps   xmm4, XMMWORD PTR [rsp-40]
        movaps  XMMWORD PTR [rsp-40], xmm4
        movq    xmm1, QWORD PTR [rsp-32]
        movq    xmm0, QWORD PTR [rsp-40]
        ret

Without union:

__m128 add(__m128 a, __m128 b) noexcept
{
    return _mm_add_ps(a, b);
}
add(float __vector(4), float __vector(4)):
        addps   xmm0, xmm1
        ret

Note that the second case also applies when the __m128 vector is wrapped in an enclosing struct or union.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
JustClaire
  • 451
  • 3
  • 11
  • 1
    It isn't passed in memory. The return value is passed in XMM0 and XMM1. Note the two ```movq``` instructions. That's how 4 floats are returned normally in SysV ABI, using the lower 64 bit of these two registers. – Homer512 Jan 02 '23 at 21:44
  • @Homer512 youre right, i will update the question – JustClaire Jan 02 '23 at 21:47
  • Related question: https://stackoverflow.com/questions/68369577/how-to-control-the-abi-for-unions – Homer512 Jan 02 '23 at 22:07
  • @Homer512 i have seen that question, but unfortunately it does not explain the reason behind the ABI difference – JustClaire Jan 02 '23 at 22:10
  • 2
    I think it follows from the [SysV calling convention](https://stackoverflow.com/questions/18133812/where-is-the-x86-64-system-v-abi-documented) and its different rules for ```__m128``` and composite types. But GCC seems to "coalesce" types like ```__m128``` and ```__m128i``` and then ignore the union. The ABI also doesn't spell out how conflicting union fields (same bytes being float and int) are handled.GCC prefers int. – Homer512 Jan 02 '23 at 23:11
  • I have tested it with MSVC as well, and it seems that MSVC does the same memory-add thing as GCC including for the single-element union unless `__vectorcall` attribute is used. – JustClaire Jan 02 '23 at 23:22
  • Using `union` for type-pruning is not defined in C++ anyways (although it "usually works" with most compilers) – chtz Jan 03 '23 at 00:13
  • @chtz Well, I am working with architecture-specific intrinsics here, so we are way past the implementation-defined border :) – JustClaire Jan 03 '23 at 01:04
  • The args aren't passed on the stack, that's just GCC's horrible strategy for merging them, instead of 2x `movlhps` and `movhlps`. That's a massive missed optimization that will kill performance even further if this doesn't inline (and thus optimize remove the calling-convention overhead.) – Peter Cordes Jan 03 '23 at 01:48

1 Answers1

2

As suspected by Homer512, the answer lies in the AMD64 calling convention.

As per the System V AMD64 ABI section 3.2.3, every 8 bytes receives its own argument class (arguments smaller than 8 bytes are grouped together or padded).

For an argument to be passed in a single vector register, it must consist of at least a single SSE class followed by any amount of SSEUP classes. SSE class denotes the lower-order 64 bits of a register, while SSEUP denotes the higher-order 64 bits.

__m128 and other vectors, for instance, are treated as multi-8-byte arguments consisting of SSE and SSEUP classes, so they are passed in a single register. In turn, every scalar float is assigned the SSE argument class and is passed in the lower portion of the registers.

Argument classes for aggregate types (arrays, structs and classes) and unions, however, are determined based on their composition.

As such, given a union:

union simd
{
    __m128 vec;
    float vals[4];
};

The __m128 vec vector falls under the special-case rule, and is classified as SSE+SSEUP, making it possible to pass by a single register. So far so good. However, since the float vals[4] array consists of 2 (independent!) 8-byte chunks, and each of the 8-byte chunks is assigned the SSE class, the array itself is in turn classified as SSE+SSE, which does not fit the SSE+SSEUP requirement, which in turn forces it to be passed using the lower portions of the 2 separate XMM registers, and as the lowest-common-denominator, causes the union itself to be treated as 2 arguments and passed in 2 registers.

To put it shortly, the calling convention treats the array as 2 separate 8-byte arguments, and as such has to pass it in 2 separate registers, while the standalone __m128 is treated as a single argument and is passed in a single register.

This, curiously, makes it so that the following union

union simd
{
    __m128 vec;
    float vals[2];
};

Is, in fact, treated as SSE+ SSEUP classes and is thus passed in a single register. The __m128 vec is treated as SSE & SSEUP, while float vec[2] is treated as a single SSE class.

Unfortunately, it seems there is no way to explicitly specify (or hint) argument classes to the compiler.

JustClaire
  • 451
  • 3
  • 11