We can use the following intrinsics for cast, they do not produce any instruction:
_mm256_castps128_ps256 __m128 → __m256
_mm256_castps256_ps128 __m256 → __m128
_mm512_castps256_ps512 __m256 → __m512
_mm512_castps512_ps256 __m512 → __m256
There are similar intrinsics for different types:
_mm256_castsi256_si128 __m256i → __m128i
_mm256_castsi128_si256 __m128i → __m256i
Also, if we need to convert between various types of the same size, we can use the following typecast intrinsics:
_mm256_castps_si256 __m256 → __m256i
_mm256_castsi256_ps __m256i → __m256
All cast instructions are listed at https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6371&cats=Cast
Therefore, the code to combine four xmm registers into one zmm register is the following:
__m512 m512_combine_m128x4(__m128 x4, __m128 x3, __m128 x2, __m128 x1)
{
const __m256 h = _mm256_set_m128(x4, x3);
const __m256 l = _mm256_set_m128(x2, x1);
return _mm512_insertf32x8(_mm512_castps256_ps512(l), h, 1);
}
It is translated into two vinsertf128 instructions and one vinsertf32x8 instruction.
Therefore, once we use cast intrinsics, this is going all too easy. The code to split is similar, using the casts above mentioned. However, to extract lowest bits, you may simply cast from a wider type to a narrower type with data loss.
Here is an example to extract four xmm registers from one zmm register using __m128i and __m512i data types: m512_split_m128x4 which translates into 3 assembly instructions and produces all all results directly from the input zmm register without any intermediary instructions that would have created a dependency chain, as suggested by Peter Cordes in the comment:
void m512_split_m128x4(__m512i r, __m128i &x4, __m128i &x3, __m128i &x2, __m128i &x1)
{
x1 = _mm256_castsi256_si128(_mm512_castsi512_si256(r));
x2 = _mm256_extracti128_si256(_mm512_castsi512_si256(r), 1);
x3 = _mm256_castsi256_si128(_mm512_extracti32x8_epi32(r, 1));
x4 = _mm512_extracti32x4_epi32(r, 3);
}
Here is a more complicated example using intermediary m256_split_m128x2 function; however, it creates a dependency chain:
void m256_split_m128x2(__m256i r, __m128i &hi, __m128i &lo)
{
hi = _mm256_extracti128_si256(r, 1);
lo = _mm256_castsi256_si128(r);
}
void m512_split_m128x4(__m512i r, __m128i &x4, __m128i &x3, __m128i &x2, __m128i &x1)
{
const __m256i h = _mm512_extracti32x8_epi32(r, 1);
const __m256i l = _mm512_castsi512_si256(r);
m256_split_m128x2(h, x4, x3);
m256_split_m128x2(l, x2, x1);
}