0

What is the most performant way to split one AVX (AVX2) register into two SSE (SSE2) registers and backwards - join (concatenate) two SSE registers to create one AVX register?

I need this for all types of registers - integer, float, double.

For example I made following code for float case:

Try it online!

#include <immintrin.h>

__m128 avx_to_sse_ps(__m256 a, __m128 * hi) {
    *hi = _mm256_castps256_ps128(
        _mm256_permute2f128_ps(a, a, 0b0000'0001)
    );
    return _mm256_castps256_ps128(a);
}

__m256 sse_to_avx_ps(__m128 a, __m128 b) {
    return _mm256_permute2f128_ps(
        _mm256_castps128_ps256(a),
        _mm256_castps128_ps256(b),
        0b0010'0000
    );
}

int main() {}

Is it possible to make this code anyhow faster? What about integer and double cases, will be optimal code for them similar to this one?

Arty
  • 14,883
  • 6
  • 36
  • 69
  • 3
    You want `vinsertf128` / `vextractf128` in asm. In C, `_mm256_set_ps128(__m128, __m128)` should compile to that. For extract, yes cast the low half but extract the high half. – Peter Cordes Jun 12 '21 at 19:02
  • 2
    Also note that `__m128` doesn't have to mean SSE; AVX provides VEX encodings of every 128-bit vector instruction, and even new AVX-only instructions like [`vpermilps`](https://www.felixcloutier.com/x86/vpermilps) are available in `__m128` opernad size. – Peter Cordes Jun 12 '21 at 19:02
  • @PeterCordes Thanks for answer! And here is my [another question](https://stackoverflow.com/questions/67955549/) if you don't mind. – Arty Jun 13 '21 at 06:35

0 Answers0