0

I'm trying to convert this to AVX2:

// parallel arrays
int16_t* Nums = ...
int16_t* Capacities = ...
int** Data = ...

int* freePointer = ...

for (int i = 0; i < n; i++)
{
    if (Nums[i] == 0)
        Capacities[i] = 0;
    else
    {
        Data[i] = freePointer;
        freePointer += Capacities[i];
    }
}

But didn't get too far:

for (int i = 0; i < n; i += 4) // 4 as Data is 64 bits
{
    const __m256i nums = _mm256_loadu_si256((__m256i*)&Nums[i]);
    const __m256i bZeroes = _mm256_cmpeq_epi16(nums, ZEROES256);
    const __m256i capacities = _mm256_loadu_si256((__m256i*)&Capacities[i]);
    const __m256i zeroedCapacities = _mm256_andnot_si256(bZeroes, capacities);
    _mm256_storeu_si256((__m256i*)&Capacities[i], zeroedCapacities);


}

Stuck at the else branch, not sure how to add (prefix sum?...) Capacities into freePointer and assign the "serial" results to Data in the same 256-bit SIMD register. My terminology is probably off, I hope the code gets across what I'm trying to accomplish.

lane0: freePointer
lane1: freePointer + Capacities[i + 0]
lane2: freePointer + Capacities[i + 0] + Capacities[i + 1]
lane3: freePointer + Capacities[i + 0] + Capacities[i + 1] + Capacities[i + 2]

Basically this is what I want to do in as few instructions as possible, if at all possible. Target is AVX2.

GlassBeaver
  • 196
  • 4
  • 15
  • 2
    That's called a "prefix sum". [parallel prefix (cumulative) sum with SSE](https://stackoverflow.com/q/19494114) shows that for 8x 32-bit FP elements, or 4x 32-bit FP elements with SSE. You could maybe adapt the shuffle for 64-bit integer elements, but the shuffle latency might be a problem. (See links in [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) for hsums, which needs some of the same shuffling.) – Peter Cordes Sep 24 '21 at 15:09
  • 1
    Oh, and you need to mask your incoming data on the fly, but I think you have that figured out with `cmpeq_epi16` / `andnot`. But yeah, then widen to 64-bit pointer width with `vpmovzx` I guess. (Or 32-bit array offsets if you want better data density.) – Peter Cordes Sep 24 '21 at 15:15
  • Thanks! So you're saying there's a good chance it might be too slow even if I figured out how to do it, correct? – GlassBeaver Sep 24 '21 at 15:21
  • 2
    It's worth trying to vectorize, especially if you can change `Data[]` to an array of `uint32_t` offsets relative to an array. (Or if pointers are 32-bit, e.g. `gcc -mx32` or actual 32-bit mode). But even with 64-bit pointers, you could maybe do some of the prefix summing in groups of 4 elements using fast 128-bit shuffles, and only at the end widen to 64-bit. – Peter Cordes Sep 24 '21 at 15:42

1 Answers1

0

You can find a lot of details here: https://stackoverflow.com/a/69452433/5021064

Here you can plug in any type instead of T and U see the resulting asm for x86 and arm

Denis Yaroshevskiy
  • 1,218
  • 11
  • 24