Is there an efficient way to get the first non-zero element in an SIMD register using SIMD intrinsics?

Question

As the title reads, if a 256-bit SIMD register is:

0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

How can I efficiently get the index of the first non-zero element (i.e. the index 2 of the first 1)? The most straightforward way is to store into memory and check one by one, but it may cost to much. Is there any cute ideas to do so?

Peter Cordes · Accepted Answer · 2022-04-01T22:06:43.933

PCMPEQB/W/D/Q against an all-zero register to get a vector with elements that are all-1 for the zero elements and all-zero for the zero elements.
PMOVMSKB to turn the vector of all-ones or all-zero into an integer bitmask. (Or movmskps or pd to get 1 bit per dword or qword, instead of per byte, if that makes your bit-scan -> index calculation more efficient, like if you want an element offset instead of a byte offset.)
invert that (C ~ operator, asm NOT instruction) to get 1s in the bitmap for elements that were non-zero
TZCNT or BSF that integer to find the first (lowest) set bit. Beware of BSF's behaviour if its input is all-zero. But fortunately that's not a problem when the input is an int ~bitmask - the high 16 zero bits become 1s. (An AVX2 version of this with vpmovmskb ymm that filled a whole uint32_t with possibly-1 bits could use ~(uint64_t)bitmask, or just use tzcnt since AVX2 CPUs also have BMI1.)

For example with intrinsics:

int first_nonzero_byte(__m128i v){
    //__m128i v = _mm_loadu_si128((const __m128i*)p);  // for a pointer arg
    __m128i vcmp = _mm_cmpeq_epi8(v, _mm_setzero_si128());
    unsigned bitmask = _mm_movemask_epi8(vcmp);
#ifdef __GNUC__
    return __builtin_ctz(~bitmask);
#else
    return _tzcnt_u32( ~bitmask );
#endif
   // returns 16 if v was all zero so ~bitmask is 0xFFFF0000
}

Compiles on https://godbolt.org/z/Y8vYbsW69 to

# GCC11.2 -O3 -msse4.1
        movdqa  xmm1, xmm0      # missed optimization, should zero XMM1 instead
        pxor    xmm0, xmm0
        pcmpeqb xmm0, xmm1
        pmovmskb        eax, xmm0
        not     eax
        rep bsf eax, eax        # tzcnt on new CPUs, BSF on old
        ret

In GNU C where _tzcnt_u32 won't compile without -march=haswell or something, we use __builtin_ctz. As I said, ~bitmask is guaranteed to be non-zero. tzcnt is encoded as rep bsf; old CPUs will execute it as bsf, producing the same result for non-zero inputs. New CPUs will execute it as tzcnt, which is more efficient on AMD (2 uops instead of 7). Intel executes either as single-uop. GCC uses rep bsf aka tzcnt if you don't tell it a specific CPU to tune for.

For a related function like shown in JATothrim's answer, using only 4 single-uop instructions (actually 2 uops for tzcnt on AMD) instead of 8 instructions including a pblendvb (2 uops on Intel). The shuffle/horizontal-reduction idea in that answer could possibly be useful if you want the element index as a shuffle control vector for vpermilps, but seems sub-optimal vs. this when you actually want a scalar int.

int equal_first_dword_bitscan(__m128i x, __m128i y)
{
    __m128i vcmp = _mm_cmpeq_epi32(x,y);
    unsigned bitmask = _mm_movemask_ps(_mm_castsi128_ps(vcmp));
    bitmask |= 1<<4;    // return 4 if the low 4 bits are all 0
#ifdef __GNUC__
    return __builtin_ctz(bitmask);
#else
    return  _tzcnt_u32( bitmask );  // runs as BSF on old CPUs, don't skip the OR
#endif
}

MSVC doesn't have __builtin_ctz, but will compile _tzcnt_u32 even if you haven't told it the target CPU supports BMI1. If you're definitely only running on CPUs with BMI1, you can leave out the bitmask |= 1<<4; so it will return 32 for not-found.

If you use trailing-zero count in multiple functions, best to wrap that ifdef stuff in a helper function, instead of at each use-case.

If there's only one possible non-zero value (like 1), PCMPEQB against a vector of that so you don't need to invert it later.

If that's the case, consider storing your data in a bitmap in the first place, to decrease your cache footprint by a factor of 8. Then you just TZCNT 64-bit chunks of the array.

Or for a larger array of data, search for the first non-zero vector with SIMD, then TZCNT the first non-zero element of it, if you expect there to be multiple qwords of zeros before the first set bit. Like memcmp does for finding the mismatching byte position.
See Efficiently find least significant set bit in a large array? and How to find the first nonzero in an array efficiently?

BTW, the asm instruction ref manual lists the relevant C intrinsics at the bottom of each entry, and you can search Intel's intrinsics finder by asm mnemonic. (See the x86 tag wiki for links).

Thank you @Peter. I think you mean `LZCNT` instead of `TZCNT`. Acutally the asm intructions are better, and thanks for the intrinsics information anyway. Just as you mentioned, there is only one possible non-zero value, but any idea how to implement in the assembly level regarding the `cache footprint` issue? — MarZzz, Oct 14 '16 at 04:27
@MarZzz: The high bit of element 0 (first arg to `_mm_set_epi8`, last arg to `_mm_setr_epi8`) goes into the the LSB of the integer mask. TZCNT / BSF look at the low bit first, so using them scans from low address to high address (if the vector was loaded from memory). If you want to scan in the other direction, use LZCNT or BSR (which give different results). — Peter Cordes, Oct 14 '16 at 06:17
@MarZzz: What's not obvious about implementing a bitmap in asm? For this use-case, `tzcnt rax, [my_bitmap + rsi]` or whatever to see if there are any hits in the 64 bits starting at 8*rsi (since memory is still byte-addressed, unless you use the BT/BTR/BTS instructions, but don't because they're super-slow with memory operands, see http://agner.org/optimize/) — Peter Cordes, Oct 14 '16 at 06:20
Thanks for clearing the TZCNT issue, but I am confused regarding the cache issue. Do you mean to store the 256-bit data into a bitmap first, without `PCMPEQ` or `PMOVMSKB`, and then TZCNT every 64-bit (i.e. 4 TZCNT instructions are executed) of the bitmap? If so, TZCNT is executed 4 times, will this be faster? and why is the `cache footprint` decreased by a factor of 8? — MarZzz, Oct 14 '16 at 12:10
@MarZzz: No, I mean instead of having vectors where every byte is either 0 or 1, pack them down to bits ahead of time. If you don't need your data in the expanded format for something else, store it in a packed bitmap in the first place. I was assuming you had a big array of elements that you were operating one vector at a time, in which case that has 8 times the cache footprint of an equivalent bitmap. — Peter Cordes, Oct 14 '16 at 17:02
If understood correctly, the whole process is (1) get 1 bit of every 8 bits in the 256-bit data and store in a bitmap (i.e. a 32-bit bitmap for 256-bit data), (2) repeatedly process all the data, (3) `TZCNT` every 64-bit of the bitmap, (4) divide the zero count by 4 to get the index. If so, since values are either 0 or 1 initially in a vector register, is there an SIMD instruction to gather all the `LSB`s of every 32-bit data element (i.e. get a 8-bit bitmap for 256-bit data)? — MarZzz, Oct 15 '16 at 00:49
If not, is this an efficient way: (1) rotate left by 1 bit of the 256-bit data to move `LSB` to `MSB`; (2) `VMOVMSKPS` the rotated data. — MarZzz, Oct 15 '16 at 00:53
@MarZzz: Well IDK how you're generating your data. I assumed you were reading it from memory, so I was suggesting that you change whatever writes it to write a bitmap instead. Is generating it performance-critical as well as searching it? And are you generating it with something that naturally produces SIMD vectors? What is your element size? 32-bit? — Peter Cordes, Oct 15 '16 at 01:22
@MarZzz: Yes, shift and VMOVMSKPS should be a better choice even for integer data, vs. the amount of scalar code you'd need to extract every 4th bit of a VPMOVMSKB result. But you actually need a left shift by 31 to move the LSB to the MSB. And AVX doesn't have rotates (until AVX512). Note that element boundaries don't matter; you can use any shift larger than your element size. (e.g. VPSLLW on 8-bit data and then VPMOVMSKB) — Peter Cordes, Oct 15 '16 at 01:28
Anyway, you're mixing up your left and right again, like you were with TZCNT. Shifting left multiplies by 2. Little-endian doesn't change the fact that MSB = leftmost bit. — Peter Cordes, Oct 15 '16 at 01:29
Thanks for pointing out the mistake. Yes the data is 32-bit and generated in an SIMD manner so they are naturally in vector reg before searching. That's why I'd need an SIMD instruction to gather all the LSBs of every 32-bit data element. If there is no such an instruction, I'm afraid I have to use the shift + VMOVMSKPS approach. — MarZzz, Oct 15 '16 at 04:41
@MarZzz: yes, use a shift + movemask. It's one extra instruction, with 1c latency and two per clock throughput on Haswell, IIRC. x86's gather-a-bit-from-each-element instructions always take the MSB. At least x86 *has* an instruction like that. [Apparently ARM NEON doesn't](http://stackoverflow.com/questions/39506490/arm-neon-store-n-th-positions-of-non-zero-bytes-in-a-8-byte-vector-lane?noredirect=1&lq=1#comment66369748_39522540), so this would be very inconvenient there. — Peter Cordes, Oct 15 '16 at 05:30

JATothrim · Answer 2 · 2022-04-02T13:01:50.653

2

I have been lately writing bunch of "get index of X" SIMD algorithms. So far most generic way to extract index out of e.g compare mask has been via horizontal indice minimum.

Here is (unsigned) integer horizontal minimum:

int horizontal_min(__m128i x) {
    x = _mm_min_epu32(x, _mm_shuffle_epi32(x, 0b01001110));
    x = _mm_min_epu32(x, _mm_shuffle_epi32(x, 0b11100001));
    return _mm_extract_epi32(x,0);
}

Now do following:

int equal_first(__m128i x, __m128i y) {
    const __m128i index = _mm_set_epi32(0,1,2,3);
    // Compute mask
    __m128i mask = _mm_cmpeq_epi32(x,y);
    // Select indices.
    mask = _mm_blendv_epi8(_mm_set1_epi32(-1), index, mask);
    // mask = index | (~mask);
    // pick smallest indice.
    return horizontal_min(mask);
}

The advantage of this code is that you don't need any bit scanning instructions and it is all done on FPU.

Tip: It becomes very efficient with 16-bit indices if you use phminposuw128 instruction to compute the minimum index.

EDIT: Peter's analysis pointed that my solution slower unless you need the result in SIMD register.

Another case is a reduction loop(s) where you want the index of said element in array. In loop, you accumulate the e.g. min/max element indices in SIMD register. The now unordered indices may point anywhere in the source array. Now you have to use horizontal_min() to tell where the min/max element was.

edited Apr 02 '22 at 13:01

answered Apr 01 '22 at 17:34

JATothrim

842
1
8
24

blendv is 2 uops on most CPUs. Better would be `_mm_or_si128` with a compare that's false (0) in the elements you want to keep, and true (0xFFFFFFFF) in the elements you want to reject. Like in this case where it's looking for any non-zero element, `_mm_cmpeq_epi32(x, _mm_setzero_si128());` `_mm_or_si128(index, mask);` Then you can drop the `set1(4)` vector constant, too. (It already could have been `-1`, the highest unsigned number, which is cheaper for compilers to materialize with `pcmpeqd xmm0,xmm0`.) `pminud` is also SSE4.1 so avoiding blendv doesn't make this work with just SSE2. – Peter Cordes Apr 01 '22 at 21:06
Anyway, interesting idea for wideish elements like dword or qword where horizontal reduction doesn't take *too* many shuffle/min steps, but it's still not faster. I don't see what's so great about avoiding bit-scanning, unless you actually want the result as a vector, e.g. as a shuffle control for `vpermilps`. – Peter Cordes Apr 01 '22 at 21:17
Your `equal_first` function can be implemented (https://godbolt.org/z/Ks46E81Gh) with cmp/movemask/bit-scan in 3 single-uop instructions if you have tzcnt, otherwise 4 if it might run on CPUs with only `bsf`. (`__builtin_ctz(bitmask | (1<<4))` gives `4` if none of the low bits were set.) `bsf` is slowish on AMD CPUs, but `rep bsf` runs as `tzcnt` on CPUs that support it, `bsf` on CPUs that don't, and gives the same results for non-zero inputs. Hence the OR trick to make sure it's always non-zero. Yours takes 8 instructions, 9 uops, after optimizing with `set1(-1)` instead of 4. – Peter Cordes Apr 01 '22 at 21:19
So for throughput, mine is much better (fewer front-end and back-end uops). For latency, yours has 7 cycles of latency of SIMD instructions, plus `movd` being about 2 or 3 c latency like `pmovmskb`. So about 9 cycles. Mine is 1 (pcmpeqd) + 3 (pmovmskb) + 1 (or) + 3 on Intel (tzcnt/bsf) = 8 cycles on Intel. Or 7 on AMD Zen where `tzcnt` is 2 uops, 2c latency (probably bit-reverse and lzcnt). (Yours is also a cycle faster on AMD, where `vpblendvb` is single-uop 1c latency.) – Peter Cordes Apr 01 '22 at 21:30
@PeterCordes Nice analysis! Yes, this code is likely biased towards AMD CPUs since it is what I have. – JATothrim Apr 02 '22 at 12:12

Is there an efficient way to get the first non-zero element in an SIMD register using SIMD intrinsics?

2 Answers2

Linked

Related