I'm trying to convert this to AVX2:
// parallel arrays
int16_t* Nums = ...
int16_t* Capacities = ...
int** Data = ...
int* freePointer = ...
for (int i = 0; i < n; i++)
{
if (Nums[i] == 0)
Capacities[i] = 0;
else
{
Data[i] = freePointer;
freePointer += Capacities[i];
}
}
But didn't get too far:
for (int i = 0; i < n; i += 4) // 4 as Data is 64 bits
{
const __m256i nums = _mm256_loadu_si256((__m256i*)&Nums[i]);
const __m256i bZeroes = _mm256_cmpeq_epi16(nums, ZEROES256);
const __m256i capacities = _mm256_loadu_si256((__m256i*)&Capacities[i]);
const __m256i zeroedCapacities = _mm256_andnot_si256(bZeroes, capacities);
_mm256_storeu_si256((__m256i*)&Capacities[i], zeroedCapacities);
}
Stuck at the else branch, not sure how to add (prefix sum?...) Capacities into freePointer and assign the "serial" results to Data in the same 256-bit SIMD register.
My terminology is probably off, I hope the code gets across what I'm trying to accomplish.
lane0: freePointer
lane1: freePointer + Capacities[i + 0]
lane2: freePointer + Capacities[i + 0] + Capacities[i + 1]
lane3: freePointer + Capacities[i + 0] + Capacities[i + 1] + Capacities[i + 2]
Basically this is what I want to do in as few instructions as possible, if at all possible. Target is AVX2.