5

The quirky instruction (v)pmovmskb, dating back to SSE, takes the most significant bits of the bytes in an mm, xmm, or ymm register and moves them into a general purpose register. This is very useful for classifying vector elements or performing SWAR operations on individual bits. Specifically, I have used this instruction in a previous answer to compute a positional population count.

Unfortunately, this instruction has not been extended to ZMM registers and is surprisingly absent from the AVX-512 roster. How can I emulate its effect efficiently for ZMM registers? What similar/other options do I have?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
fuz
  • 88,405
  • 25
  • 200
  • 352
  • 1
    [_mm512_movepi8_mask](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512&expand=5902,3883,3883,3883&text=mm512_movepi8_mask) ? – Paul R Aug 07 '20 at 13:00
  • @PaulR How could I have missed that... embarrassing. Would you mind writing this up as an answer so we can get over this? – fuz Aug 07 '20 at 13:10
  • Ha - no problem - am currently on a mobile device so if you’d like to go ahead and self-answer that’s fine with me. – Paul R Aug 07 '20 at 13:12

1 Answers1

5

There is an instruction for that in AVX512BW, just with a different name. _mm512_movepi8_mask / vpmovb2m k, zmm, available in every element size from byte to qword.
(AVX512DQ for the D and Q versions, AVX512BW for the B and W versions).

There's also the mask->vector inverse movemask, vpmovm2b (again available in all element sizes).


AVX512 of course also has various cmp and test into mask instructions, so with a set1_epi8(1<<n) vector, you can grab any bit-position into a mask register with vptestmb k2{k1}, zmm2, zmm3/m512 ; _mm512_test_epi8_mask. Note that unlike vpmov2bm, it supports zero-masking into the destination to effectively AND with another k mask for free, so it might be worth using even if you just want the high bit.

There's also a NAND version vptestnmb. The D and Q versions of these support broadcast-memory source operands, but B and W versions don't.

With 8 different mask constants, you can extract different bits in an unrolled loop without spending any shift instructions. Or you can extract different bits from different elements.

These are all AVX512BW, available on AVX512 CPUs since Skylake-AVX512, but not Xeon Phi (KNL / KNM).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847