Extract four XMM registers from one ZMM register and vice versa using C intrinsics

Question

There was a general question moving data between SSE and AVX512 registers. In contrast to that question, the present question is not about assembly instructions, but about C intrinsics.

There is an intrinsic to insert two xmm registers into one ymm: __m256 _mm256_set_m128 (__m128 hi, __m128 lo) which corresponds to the vinsertf128 instruction, but there is no similar intrinsic to insert two ymm registers into one zmm register, such as a supposed __m512 _mm512_set_m256 (__m256 hi, __m256 lo). With assembly instructions, when I set an ymm register (i.e. by vinsertf128), the operation also explicitly clears upper 256 bits of the corresponding zmm register, but what is the similar intrinsic for C to typecast ymm to zmm? All intrinsics for the vinsertf32x8 instruction already need a 512-bit zmm register on input whereas _mm256_set_m128 only returns a 256-bit ymm register.

What are the C intrinsics to extract four xmm registers from one zmm register and vice versa? I cannot make an assembly function with defined registers, I need the instructions to be inlined with any registers available.

An alternative you may want to consider is to put them in a union. e.g. typedef union {__m512 AVX512; __m128 SSE[4];} vecunion_t; — Simon Goater, Apr 16 '23 at 14:58

score 4 · Answer 1 · answered Apr 16 '23 at 12:11

4

To extract, use

/* floating point domain */
__m128 _mm512_extractf32x4_ps (__m512 a, int imm8);
__m256 _mm512_extractf32x8_ps (__m512 a, int imm8);

/* integer domain */
__m128i _mm512_extracti64x2_epi64 (__m512i a, int imm8);
__m256i _mm512_extracti64x4_epi64 (__m512i a, int imm8);

To assemble, either go through memory (e.g. store into an array of 4 __m128 and then load from it) or use a sequence of insertion instructions. Note that insertions are cross-lane operations and as such quite slow. Going through memory might be faster, you should measure it.

/* floating point domain */
__m512 _mm512_insertf32x4 (__m512 a, __m128 b, int imm8);
__m512 _mm512_insertf32x8 (__m512 a, __m256 b, int imm8);

/* integer domain */
__m256i _mm256_inserti64x2 (__m256i a, __m128i b, int imm8);
__m512i _mm512_inserti64x4 (__m512i a, __m256i b, int imm8);

answered Apr 16 '23 at 12:11

fuz

88,405
25
200
352

I cannot use memory, as it is too obvious, so I would not have asked a question, I need to avoid memory loads/stores. – Maxim Masiutin Apr 16 '23 at 12:42
Maybe you know how to do that with extended inline asm + qualifiers https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html ? – Maxim Masiutin Apr 16 '23 at 12:45
1

@MaximMasiutin I gave you the intrinsics you can use to do it without using memory. I'm just saying that using memory may perform better for the 4×128→512 case. At the end of the day, joining vectors is a bunch of cross-lane operations, which are inherently expensive. – fuz Apr 16 '23 at 13:23
I was aware of these intrinsics which correspond to assembly instructions, and I also read an earlier answer I referred too. I was not aware of the cast instructions, so I asked in my question how to "typecast ymm to zmm". Once I found the cast intrinsics, it became too easy. – Maxim Masiutin Apr 16 '23 at 13:37
1

@MaximMasiutin Sorry, your question was unclear in this regard. I understood only the last paragraph of it as being the question you want to have answered. – fuz Apr 16 '23 at 13:55
1

Lane-crossing shuffles aren't *that* slow. They have 1 clock throughput cost on Intel, 0.5c on Zen 4 (except for the ZMM version). So two pairs of `vinsertf32x4 ymm`, then combine those to 512 with `vinsertf32x8 zmm`. The critical path latency is then 6 cycles (2x3) on Intel, 2 on AMD (2x1). (@MaximMasiutin) Store/reload would have higher latency due to a store-forwarding stall. – Peter Cordes Apr 16 '23 at 20:59
1

Extracting via store/reload would be more reasonable, but you're competing against 3 shuffles that could all run in parallel. (Limited only by the resource conflict for the shuffle unit, although Zen 4 has good throughput for `vextractf32x8` (0.5c but can run on any of its four SIMD/FP ports) and `vextractf32x4 ymm` (0.25c), so only the highest 128-bit lane needs `vextractf32x4 zmm` (0.5c throughput on Zen 4). https://uops.info/, @MaximMasiutin) – Peter Cordes Apr 16 '23 at 21:02
@PeterCordes Main problem is that cross-lane shuffles can only run on port 5, so it adds up to 9 cycles in the end. Plus it blocks port 5, which is often busy with other things. – fuz Apr 17 '23 at 08:57
@fuz: shuffles are fully pipelined; merging 4 vectors has 3 cycles of *throughput* cost. The critical path latency from any single input to the output is 6 cycles (on Intel, 2 on AMD), if the other inputs are ready sooner so the other pair can merge first. Or if they're all ready at the same time, 7 cycle latency due to the resource conflict. But out-of-order exec can fill those gaps and keep port 5 busy with independent work for the 4 out of 7 cycles where it's not busy with these shuffles. 4 stores and a slow stall load does take pressure off port 5, if it was the bottleneck, though. – Peter Cordes Apr 17 '23 at 09:07
Or it would be 9 cycles worst-case critical-path latency if you did it wrong, like a single chain of 3 shuffles, starting with a pair including the one that tended to actually be on the critical path. If you know which input is usually the slowest to be ready, you could make it the last one to insert, easy if it's not the lowest lane, otherwise perhaps using `valignq` or `vshufi32x4` for the final shuffle if that works. – Peter Cordes Apr 17 '23 at 09:12
@PeterCordes The reason I recommend loads and stores is that you have more load/store ports than just the single port 5. So in my experience, load/store is often less crowded and the overall performance may be better. Still, this needs to be measured for each indvidual case. – fuz Apr 17 '23 at 09:23
Yeah, for sure, especially with AVX-512, since Intel CPUs shut down port 1 while 512-bit uops are in flight. But store/reload takes more front-end uops, so can also be worse if that's your throughput bottleneck. Sapphire Rapids further widening the pipeline (and adding a load port) but not providing additional shuffle throughput swings the balance farther towards store/reload, but beware that [store-forwarding stalls aren't pipelined with each other, only with successful store-forwarding and normal loads.](https://stackoverflow.com/a/69631247) – Peter Cordes Apr 17 '23 at 09:28
So store/reload is good in code that does a lot of vector ALU work, isn't bottlenecked on the front-end or store/load ports, doesn't need to merge more than about once per 15 cycles (TODO: test throughput of *independent* SF stalls), and the merge isn't part of a latency bottleneck critical path, like a loop-carried dep chain. – Peter Cordes Apr 17 '23 at 09:31

Maxim Masiutin · Answer 2 · 2023-04-16T22:58:45.490

4

We can use the following intrinsics for cast, they do not produce any instruction:

_mm256_castps128_ps256    __m128  →  __m256
_mm256_castps256_ps128    __m256  →  __m128

_mm512_castps256_ps512    __m256  →  __m512
_mm512_castps512_ps256    __m512  →  __m256

There are similar intrinsics for different types:

_mm256_castsi256_si128   __m256i  →  __m128i
_mm256_castsi128_si256   __m128i  →  __m256i

Also, if we need to convert between various types of the same size, we can use the following typecast intrinsics:

_mm256_castps_si256    __m256   →  __m256i
_mm256_castsi256_ps    __m256i  →  __m256

All cast instructions are listed at https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6371&cats=Cast

Therefore, the code to combine four xmm registers into one zmm register is the following:

__m512 m512_combine_m128x4(__m128 x4, __m128 x3, __m128 x2, __m128 x1) 
{
    const __m256 h = _mm256_set_m128(x4, x3);
    const __m256 l = _mm256_set_m128(x2, x1);
    return _mm512_insertf32x8(_mm512_castps256_ps512(l), h, 1);
}

It is translated into two vinsertf128 instructions and one vinsertf32x8 instruction.

Therefore, once we use cast intrinsics, this is going all too easy. The code to split is similar, using the casts above mentioned. However, to extract lowest bits, you may simply cast from a wider type to a narrower type with data loss.

Here is an example to extract four xmm registers from one zmm register using __m128i and __m512i data types: m512_split_m128x4 which translates into 3 assembly instructions and produces all all results directly from the input zmm register without any intermediary instructions that would have created a dependency chain, as suggested by Peter Cordes in the comment:

void m512_split_m128x4(__m512i r, __m128i &x4, __m128i &x3, __m128i &x2, __m128i &x1) 
{
        x1 = _mm256_castsi256_si128(_mm512_castsi512_si256(r));
        x2 = _mm256_extracti128_si256(_mm512_castsi512_si256(r), 1);
        x3 = _mm256_castsi256_si128(_mm512_extracti32x8_epi32(r, 1));
        x4 = _mm512_extracti32x4_epi32(r, 3);
}

Here is a more complicated example using intermediary m256_split_m128x2 function; however, it creates a dependency chain:

void m256_split_m128x2(__m256i r, __m128i &hi, __m128i &lo) 
{
    hi = _mm256_extracti128_si256(r, 1);
    lo = _mm256_castsi256_si128(r);
}

void m512_split_m128x4(__m512i r, __m128i &x4, __m128i &x3, __m128i &x2, __m128i &x1) 
{
    const __m256i h = _mm512_extracti32x8_epi32(r, 1);
    const __m256i l = _mm512_castsi512_si256(r);
    m256_split_m128x2(h, x4, x3);
    m256_split_m128x2(l, x2, x1);
}

edited Apr 16 '23 at 22:58

answered Apr 16 '23 at 13:35

Maxim Masiutin

3,991
4
55
72

Why are you casting to `__m512i` to `__m512` at all? Use the integer versions of the intrinsics, like `x2 = _mm256_castsi256_si128( _mm512_extracti32x8_epi32(v, 1) );` and `x3 = _mm512_extracti32x4_epi32(v, 3);`. Also, don't create a dependency chain for extracting x3; get it directly from the ZMM input, not from the top half of a `vextracti32x8` result, unless *maybe* if you're optimizing for Zen 4 and throughput matters more than latency. On Intel it's just worse with no benefit. You need 3 total shuffles either way. – Peter Cordes Apr 16 '23 at 21:06
(For insert of course it *is* better to go with a tree of dependencies like you're doing, with two separate YMM shuffles of pairs to feed one ZMM insert.) – Peter Cordes Apr 16 '23 at 21:10
@PeterCordes - thank you, I implemented your suggestions and updated the reply – Maxim Masiutin Apr 16 '23 at 22:59
There's a `_mm512_castsi512_si128` you could have used directly for lane 0. (Which you're calling `x1`). But the resulting asm should still be the same. https://godbolt.org/z/88YEsMPdT both GCC and clang are sub-optimal, but come close to doing what you asked for. GCC wasted code-size on an EVEX extract `xmm,ymm,1`, and also does a useless `vmovdqa xmm,xmm` (to zero-extend? or instead of just doing the other extracts into other regs.) Clang does use AVX2 `vextracti128`, but doesn't know that `vextracti32x4 xmm, zmm, 2` is more expensive than `vextracti32x4 ymm, zmm, 1` on Zen 4. – Peter Cordes Apr 16 '23 at 23:49
Thank you, @PeterCordes, however, we should not rely on this site (godbold), as real clang produces different (better) code, see example at https://github.com/official-stockfish/Stockfish/pull/4477#issuecomment-1492582193 – Maxim Masiutin Apr 17 '23 at 00:07
I've used real `clang++` (version 17) and `objdump -d` and got the following code: `vmovaps %xmm0,(%rcx); vextractf32x4 $0x1,%zmm0,(%rdx); vextractf32x4 $0x2,%zmm0,(%rsi); vextractf32x4 $0x3,%zmm0,(%rdi)` – Maxim Masiutin Apr 17 '23 at 00:52
IDK why you think GCC and clang on Godbolt aren't "real". They sometimes have a different version of some standard header than a normal install, and are configured without some options on by default that most distros use (e.g. `-fPIE -fstack-protector-strong`). But other than that, they're actual GCC and clang running on Linux in an AWS instance. – Peter Cordes Apr 17 '23 at 00:57
This is what the gcc generated: `vextracti32x8 $0x1,%zmm0,%ymm1; vextracti128 $0x1,%ymm0,0x20(%rsp); vmovdqa %xmm0,0x30(%rsp); vextracti32x4 ; $0x3,%zmm0,%xmm0; vmovdqa %xmm1,0x10(%rsp)` – Maxim Masiutin Apr 17 '23 at 00:57
Your asm is an AT&T version of what clang 16 `-O3 -march=znver4` produced for the non-inlined version of the function in my Godbolt link. But many use-cases for extracting 128-bit lanes will want to do further computations in registers, so if you actually look at my Godbolt link, you'll see I wrote a test caller that used the register results so we could see what GCC/clang would do for extracting to registers. That's what I was talking about. But yes, GCC's asm for extracting to memory is sub-optimal at least for Intel, as you can see it using `vextracti32x8` with a register destination. – Peter Cordes Apr 17 '23 at 00:59
@PeterCordes they are not real because they generate suboptimal code comparing to the real compilers, as I shown in my link. For example, real clang generated the same code for all different versions of a function which implemented the same logic differently, but the site generated a `cmov` instruction which the real clang did not use. The real clang just used a bitshift without any conditional move or branching, which is obviously the fastest code. – Maxim Masiutin Apr 17 '23 at 01:00
https://github.com/official-stockfish/Stockfish/pull/4477 shows disassembly of GCC and clang output that exactly matches what you get from those GCC and clang versions on Godbolt. https://godbolt.org/z/aMdf4rsTP . No clue what you're talking about. – Peter Cordes Apr 17 '23 at 01:02
@PeterCordes - yes, you are right. Sorry. My bad. The godbolt produces the same code as the real compilers in my environment. I don't know how I messed up in the past that godbolt produced suboptimal code; maybe I didn't specify -O3 -march=znver4 in godbolt while specified that in the local command-line clang. Sorry again. – Maxim Masiutin Apr 17 '23 at 01:05

Extract four XMM registers from one ZMM register and vice versa using C intrinsics

2 Answers2