Compact AVX2 register so selected integers are contiguous according to mask

Question

In the question Optimizing Array Compaction, the top answer states:

SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB.

Is this possible with Haswell (AVX2)? Or does it require one of the flavors of AVX512?

I've got a AVX2 vector containing int32s, and a corresponding vector of the result of a compare. I want to shuffle it somehow so that the elements with the corresponding msb set in the mask (compare true) are contiguous in the low end of the vector.

The best I can see is get a bit mask with _mm256_movemask_ps/vmovmskps (no *d variant?) and then use that in a 256 AVX2 vector lookup table to get a shuffle mask for the cross-lane _mm256_permutevar8x32_epi32/vpermd

As you've noticed, AVX/AVX2 is something of a kludge and it's pretty hard to do anything which crosses 128 bit lanes - you might find that it's just as easy to stick with SSE, and there may be little or no difference in performance anyway. — Paul R, Aug 01 '14 at 08:39
The best answer probably depends on the statistics of the distribution. You will probably get the best result if it's either sparse or dense. However, if it's random there might not be any benefit using SIMD. Can we assume the distribution is sparse or dense? — Z boson, Aug 01 '14 at 08:58
@Zboson, yes most of the time it will either be sparse or dense. — Eloff, Aug 01 '14 at 16:10
I didn't know this question existed when I wrote my BMI2 + AVX2 generate-the-shuffle-mask-on-the-fly answer for the [this question](http://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask). The question does seem to be a duplicate, and I think the answers on the other question are better (I included an AVX512 answer, because there is a new instruction for this in AVX512). For arrays with runs of 100% and 0% density, @ZBoson's answer here probably wins. You could branch that way and just use my AVX2+BMI2 function for `compact()`. — Peter Cordes, Oct 23 '16 at 00:33

Z boson · Accepted Answer · 2014-08-02T09:34:24.110

The first thing to do is find a fast scalar function. Here is a version which does not use a branch.

inline int compact(int *x, int *y, const int n) {
    int cnt = 0;
    for(int i=0; i<n; i++) {
        int cut = x[i]!=0;
        y[cnt] = cut*x[i];
        cnt += cut;
    }
    return cnt;
}

The best result with SIMD probably depends on the distribution of zeros. If it's sparse or dense . The following code should work well for distribution which are sparse or dense. For example long runs of zeros and non-zeros. If the distribution is more even I don't know if this code will have any benefit. But it will give the correct result anyway.

Here is a AVX2 version I tested.

int compact_AVX2(int *x, int *y, int n) {
    int i =0, cnt = 0;
    for(i=0; i<n-8; i+=8) {
        __m256i x4 = _mm256_loadu_si256((__m256i*)&x[i]);
        __m256i cmp = _mm256_cmpeq_epi32(x4, _mm256_setzero_si256());
        int mask = _mm256_movemask_epi8(cmp);
        if(mask == -1) continue; //all zeros
        if(mask) {
            cnt += compact(&x[i],&y[cnt], 8);
        }
        else {
            _mm256_storeu_si256((__m256i*)&y[cnt], x4);
            cnt +=8;
        }       
    }
    cnt += compact(&x[i], &y[cnt], n-i); // cleanup for n not a multiple of 8
    return cnt;
}

Here is the SSE2 version I tested.

int compact_SSE2(int *x, int *y, int n) {
    int i =0, cnt = 0;
    for(i=0; i<n-4; i+=4) {
        __m128i x4 = _mm_loadu_si128((__m128i*)&x[i]);
        __m128i cmp = _mm_cmpeq_epi32(x4, _mm_setzero_si128());
        int mask = _mm_movemask_epi8(cmp);
        if(mask == 0xffff) continue; //all zeroes
        if(mask) {
            cnt += compact(&x[i],&y[cnt], 4);
        }
        else {
            _mm_storeu_si128((__m128i*)&y[cnt], x4);
            cnt +=4;
        }       
    }
    cnt += compact(&x[i], &y[cnt], n-i); // cleanup for n not a multiple of 4
    return cnt;
}

Here is a full test

#include <stdio.h>
#include <stdlib.h>
#if defined (__GNUC__) && ! defined (__INTEL_COMPILER)
#include <x86intrin.h>                
#else
#include <immintrin.h>                
#endif

#define N 50

inline int compact(int *x, int *y, const int n) {
    int cnt = 0;
    for(int i=0; i<n; i++) {
        int cut = x[i]!=0;
        y[cnt] = cut*x[i];
        cnt += cut;
    }
    return cnt;
}

int compact_SSE2(int *x, int *y, int n) {
        int i =0, cnt = 0;
        for(i=0; i<n-4; i+=4) {
            __m128i x4 = _mm_loadu_si128((__m128i*)&x[i]);
            __m128i cmp = _mm_cmpeq_epi32(x4, _mm_setzero_si128());
            int mask = _mm_movemask_epi8(cmp);
            if(mask == 0xffff) continue; //all zeroes
            if(mask) {
                cnt += compact(&x[i],&y[cnt], 4);
            }
            else {
                _mm_storeu_si128((__m128i*)&y[cnt], x4);
                cnt +=4;
            }       
        }
        cnt += compact(&x[i], &y[cnt], n-i); // cleanup for n not a multiple of 4
        return cnt;
    }

int compact_AVX2(int *x, int *y, int n) {
    int i =0, cnt = 0;
    for(i=0; i<n-8; i+=8) {
        __m256i x4 = _mm256_loadu_si256((__m256i*)&x[i]);
        __m256i cmp = _mm256_cmpeq_epi32(x4, _mm256_setzero_si256());
        int mask = _mm256_movemask_epi8(cmp);
        if(mask == -1) continue; //all zeros
        if(mask) {
            cnt += compact(&x[i],&y[cnt], 8);
        }
        else {
            _mm256_storeu_si256((__m256i*)&y[cnt], x4);
            cnt +=8;
        }       
    }
    cnt += compact(&x[i], &y[cnt], n-i); // cleanup for n not a multiple of 8
    return cnt;
}

int main() {
    int x[N], y[N];
    for(int i=0; i<N; i++) x[i] = rand()%10;
    //int cnt = compact_SSE2(x,y,N);
    int cnt = compact_AVX2(x,y,N);
    for(int i=0; i<N; i++) printf("%d ", x[i]); printf("\n");
    for(int i=0; i<cnt; i++) printf("%d ", y[i]); printf("\n");
}

In my case, I think it will be very rare that the _mm256_storeu_si256 branch will be taken. The scalar code given here causes a store forwarding stall, and will not likely outperform the lookup table + permutevar8x32 solution. This is even more true for me, because I actually have two AVX2 registers to do this shuffle with, so the table lookup to get the shuffle mask can be re-used. There's no way to know without measuring, but I'm leaning in that direction. — Eloff, Aug 01 '14 at 16:54
@Eloff, I fixed my code to handle the sparse case as well (I just had to add the line `if(mask == -1) continue; //all zeros`. I also tested the AVX2 version and it works. — Z boson, Aug 02 '14 at 09:25
@Eloff, I'm not sure why you say that it will be "very rare that the _mm256_storeu_si256 branch will be taken?. Is this because it's more sparse than dense? I fixed the code to handle both the sparse and dense case now. This should work well for long runs of zeros and non zeroes. — Z boson, Aug 02 '14 at 09:27
@Eloff, why do you say the scalar code with have a store forward stall? As far as I understand that happens when reading from an array you just wrote to. But in my code I only read from the x array and only write to the y array so I don't read and write to the same array. Maybe you're referring to something in your code? If so maybe you can post some code? — Z boson, Aug 02 '14 at 09:29
Ah, sorry I see, you're correct. I thought you were treating the avx "register" as an array, which will cause a stall forwarding stall. I can read from the original input array as well, so I'll compare it against the vpermd code and see if it's an improvement. — Eloff, Aug 03 '14 at 16:08
In case you're interested I got the idea for this from my answer here https://stackoverflow.com/questions/20927710/quickly-count-number-of-zero-valued-bytes-in-an-array/20933337#20933337. In that case even for distributions when were even SSE was still better which I did not expect. — Z boson, Aug 04 '14 at 09:37
BMI2 allows us to efficiently generate a mask on the fly from the VMOVMSKPS result, see my [`compress256()`](http://stackoverflow.com/a/36951611/224132) function (which could easily be adapted for int32 vectors, using VPERMD instead of VPERMPS). Using that in place of `compact()` and keeping the branching could be a win for arrays with runs of perfect density or sparseness. For other sizes of integers, 64-bit PDEP will still work for 16 elements of 16-bits each, but there's no lane-crossing word or byte shuffle until AVX512. 64-bit integers or double are easy, of course. — Peter Cordes, Oct 23 '16 at 00:41

Compact AVX2 register so selected integers are contiguous according to mask

1 Answers1

Linked