What is the correct intrinsic sequence to do PSRLDQ to an XMM register while keeping the YMM part unchanged?

Question

Assuming xmm0 is the first argument, this is the kind of code I want to produce.

psrldq xmm0, 1
vpermq ymm0, ymm0, 4eh
ret

I wrote this in intrinsics.

__m256i f_alias(__m256i p) {
    *(__m128i *)&p = _mm_bsrli_si128(*(__m128i *)&p, 1);
    return _mm256_permute4x64_epi64(p, 0x4e);
}

This is the result from clang, and it is okay.

f_alias: #clang
        vpsrldq xmm1, xmm0, 1
        vperm2i128      ymm0, ymm0, ymm1, 33
        ret

But gcc produces bad code.

f_alias: #gcc
        push    rbp
        vpsrldq xmm2, xmm0, 1
        mov     rbp, rsp
        and     rsp, -32
        vmovdqa YMMWORD PTR [rsp-32], ymm0
        vmovdqa XMMWORD PTR [rsp-32], xmm2
        vpermq  ymm0, YMMWORD PTR [rsp-32], 78
        leave
        ret

I tried a different version.

__m256i f_insert(__m256i p) {
    __m128i xp = _mm256_castsi256_si128(p);
    xp = _mm_bsrli_si128(xp, 1);
    p = _mm256_inserti128_si256(p, xp, 0);
    return _mm256_permute4x64_epi64(p, 0x4e);
}

clang produces the same code.

f_insert: #clang
        vpsrldq xmm1, xmm0, 1
        vperm2i128      ymm0, ymm0, ymm1, 33
        ret

But gcc is too literal in translating the intrinsics.

f_insert: #gcc
        vpsrldq xmm1, xmm0, 1
        vinserti128     ymm0, ymm0, xmm1, 0x0
        vpermq  ymm0, ymm0, 78
        ret

What is a good way to write this operation in intrinsics? I'd like to make gcc produce good code like clang if possible.

Some side questions.

Is it bad to mix PSRLDQ with AVX code? Is it better to use VPSRLDQ like what clang did? If nothing is wrong using PSRLDQ, it seems to be a simpler approach because it doesn't zero the YMM part like the VEX version.
What is the purpose of having both F and I instructions which seems to do the same job anyway, for example, VINSERTI128/VINSERTF128 or VPERMI128/VPERMF128?

Getting the compiler to generate the exact instructions you want is nearly impossible. Assuming you don't need inlining you'd probably be better off implementing it in a raw assembly file and then linking in the object file later. — vandench, Jan 23 '22 at 17:07
Even if there was some way to express this with intrinsics, [mixing Legacy-encoded SSE instructions and 256-bit AVX instructions is a bad practice](https://stackoverflow.com/q/41303780/555045) (the exact consequences vary but it's for sure safer to just not do it) — harold, Jan 23 '22 at 17:08
@vandench Well, since `clang` understands what I want to do, I was hoping there is some way to make `gcc` produce good code without using inline assembly. — xiver77, Jan 23 '22 at 17:11
@harold I'm okay with the `clang` way of using `vpsrldq`, but does the mixing also matter in this specific case? I don't think the partial dependency issue applies in this case because I do need the upper values too anyway. — xiver77, Jan 23 '22 at 17:12
There is no way to make someone else's compiler produce exactly what you want, if Clang produces what you want then use Clang; if you want to use GCC then either accept its worse code gen or manually write the assembly in a separate file, never use inline assembly. Clang is able to handle mixing SSE and AVX efficiently because LLVM has great native support for vector operations, the IR doesn't even rely on calling the x86 intrinsic functions. — vandench, Jan 23 '22 at 17:23
@vandench It's good to know that Clang is strong in that area. Anyway I was really stupid not noticing that Clang was actually giving me the answer. See my short own answer.. — xiver77, Jan 23 '22 at 18:10

Peter Cordes · Accepted Answer · 2022-01-24T06:11:38.147

Optimal asm on Skylake would use legacy SSE psrldq xmm0, 1, with the effect of leaving the rest of the vector unchanged handled with as a data dependency. (On a register the instruction reads anyway, since this isn't movdqa or something). But that would be disastrous on Haswell, or on Ice Lake, both of which have a costly transition to a "saved uppers" state when a legacy-SSE instruction writes an XMM register when any YMM has a "dirty" upper half. I'm unsure how Zen1 or Zen2/3/4... handle it.

Nearly as good on Skylake, and optimal everywhere else, is to copy-and-shift then vpblendd to copy in the original high half, since you don't need to move any data between 128-bit lanes. (The _mm256_permute4x64_epi64(p, 0x4e); in your version is a lane-swap separate from the operation you asked about in the title. If that's something else you also want, then keep using vperm2i128 to merge as part of that lane-swap. If not, it's a bug.)

vpblendd is more efficient than any shuffle, able to run on any 1 of multiple execution ports, with 1 cycle latency on Intel CPUs. (Lane-crossing shuffles like vperm2i128 are 1 uop / 3 cycle latency on mainstream Intel, and significantly worse on AMD, and on the E-cores of Alder Lake. https://uops.info/) By contrast, variable blends with a vector control are often more expensive, but immediate blends are very good.

And yes, it is more efficient on some CPUs to use an XMM (__m128i) shift, instead of shifting both halves and then blending with the original. That would take less typing with cast intrinsics, but if compilers didn't optimize it away you'd be wasting uops on Zen1, and on Alder Lake E-cores, where each half of vpsrldq ymm takes a separate uop.

__m256i rshift_lowhalf_by_1(__m256i v)
{
    __m128i low = _mm256_castsi256_si128(v);
   low = _mm_bsrli_si128(low, 1);
   return _mm256_blend_epi32(v, _mm256_castsi128_si256(low), 0x0F);
}

gcc/clang compile it as written (Godbolt), with xmm byte-shift and YMM vpblendd. (Clang flips the immediate and uses opposite source registers, but same difference.)

vpblendd is 2 uops on Zen1, because it has to process both halves of the vector. The decoders don't look at the immediate for special cases like keeping a whole half of the vector. And it can still copy to a separate destination, not necessarily overwriting either source in-place. For a similar reason, vinserti128 is also 2 uops, unfortunately. (vextracti128 is only 1 uop on Zen1; I was hoping vinserti128 was going to be only 1, and wrote the following version before checking uops.info):

// don't use on any CPU *except* Zen1, or an Alder Lake pinned to an E-core.
__m256i rshift_alder_lake_e(__m256i v)
{
    __m128i low = _mm256_castsi256_si128(v);
   low = _mm_bsrli_si128(low, 1);
   return _mm256_inserti128_si256(v, low, 0);  // still 2 uops on Zen1 or Alder Lake, same as vpblendd
    // clang optimizes this to vpblendd even with -march=znver1.  That's good for most uarches, break-even for Zen1, so that's fine.
}

There may be a small benefit on Alder Lake E-cores, where vinserti128 latency is listed as [1;2] instead of a flat 2 for vpblendd. But since any Alder Lake system will have P cores as well, you don't actually want to user vinserti128 because it's much worse on everything else.

What is the purpose of having both VINSERTI128/VPERMI128 and VINSERTF128/VPERMF128?

vinserti128 with a memory source only does a 128-bit load, vperm2i128 does a 256-bit load which might cross a cache line or page boundary for data you're not even going to use.

On AVX CPUs where load/store execution units only have 128-bit wide data paths to cache (like Sandy/Ivy Bridge), that's a significant benefit.

On CPUs where shuffle units are only 128-bit wide (like Zen1 as discussed in this answer), vperm2i128's 2 full source inputs and arbitrary shuffling make it a lot more expensive (unless I guess you had smarter decoders that emitted a number of uops to move halves of the vector dependent on the immediate).

e.g. Zen1's vperm2i/f128 is 8 uops, with 2c latency, 3c throughput!. (Zen2 with its 256-bit execution units improves that to 1 uop, 3c latency, 1c throughput). See https://uops.info/

What is the purpose of having both F and I instructions which seems to do the same job anyway

Same as always (dating back to stuff like SSE1 orps vs. SSE2 pxor / orpd), to let CPUs have different bypass-forwarding domains for SIMD-integer vs. SIMD-FP.

Shuffle units are expensive so it's normally worth sharing them between FP and integer (and the way Intel does that these days results in no extra latency when you use vperm2f128 between vpaddd instructions).

But for example blend is simple so there probably are different FP and integer blend units, and there is a latency penalty for blendvps between paddd instructions. (See https://agner.org/optimize/)

Thank you for a detailed answer! However, if I'm not mistaken, the immediate value in `vpermq ymm0, ymm0, 4eh` is `10 00 11 10` in binary, and that intends to pull down the upper half and pull up the lower half, so I believe date is being moved between 128-bit lanes. In `vperm2i128 ymm0, ymm0, ymm1, 33`, the immediate is `0x21` in hex, and this intends to select the upper half from `ymm0` as the lower half, and select the lower half from `ymm1` as the upper half, so still data is crossing the 128-bit lane. At least, that was my intention. — xiver77, Jan 24 '22 at 05:15
@xiver77: If you want something equivalent to `psrldq xmm`, there *shouldn't* be any data movement across lanes. If there is, that's a bug in that version. The low half gets byte-shifted, the high half stays the same. — Peter Cordes, Jan 24 '22 at 05:22
Yes, that is exactly what I want to achieve, only shift the low half and switch the whole 128-bit blocks. — xiver77, Jan 24 '22 at 05:23
@xiver77: Then that explains why clang's shuffle optimizer still used `vperm2i128` instead of just `vpblendd` https://godbolt.org/z/9dh3qvxds. Your orig src uses a pointer-cast to `_mm_bsrli_si128` the low half (as you can see in the asm, ymm store, then xmm store to the *same* addr), but then you use `_mm256_permute4x64_epi64(p, 0x4e);` for no apparent reason, swapping low/high lanes. If you *want* that as a separate operation from the low-lane byte shift, then yeah it's worth looking at how then can optimize into each other (like by using `vperm2i128` to merge and swap at the same time)... — Peter Cordes, Jan 24 '22 at 05:28
@xiver77: Updated my answer with that and an answer to the bonus question about why `vinserti128` exists. — Peter Cordes, Jan 24 '22 at 05:41
To explain why this operation could be useful, think about a hypothetical 4-bit machine and I want the following 7-cycle pattern. `1101 1001` -> `0110 1100` -> `0011 0110` -> `1001 1011` -> `...`. If I right shift once on each side of `1101 1001`, the result is `.110 .100`. Where do I get the upper bit? One way is to first make a copy and do right shift 1 on just one side to get `1101 1001` -> `1101 .100`. Then, permute to get `.100 1101` and do left shift 3 to get `0... 1...`. — xiver77, Jan 24 '22 at 05:56
Finally, a bitwise OR with `.110 .100`, and you get the next pattern `0110 1100`. I'm not sure if this is the most efficient way, but it's one way to achive this rotation. — xiver77, Jan 24 '22 at 05:56
I made an edit to the OP to clarify what the "bonus question" was about. Sorry, my wording was bad and made you explain what I didn't ask, but everything written here is still very informative and helpful. — xiver77, Jan 24 '22 at 06:04

score 1 · Answer 2 · answered Jan 23 '22 at 18:07

I was a fool. clang gave me an answer, and why didn't I notice it?

vpsrldq xmm1, xmm0, 1
vperm2i128      ymm0, ymm0, ymm1, 33
ret

This sequence is simply,

__m256i f_____(__m256i p) {
    __m128i xp = _mm256_castsi256_si128(p);
    xp = _mm_bsrli_si128(xp, 1);
    __m256i _p = _mm256_castsi128_si256(xp);
    return _mm256_permute2x128_si256(p, _p, 0x21);
}

and indeed gcc also produces efficient code..

f_____:
        vpsrldq xmm1, xmm0, 1
        vperm2i128      ymm0, ymm0, ymm1, 33
        ret

`vpblendd` is more efficient on more CPUs; that's all you're doing here, not moving data between halves. — Peter Cordes, Jan 23 '22 at 23:54

What is the correct intrinsic sequence to do PSRLDQ to an XMM register while keeping the YMM part unchanged?

2 Answers2