How to transform 24bit rgb to 32bit using avx2?

Question

I have done this with SSSE3, now I wonder if this could be done with AVX2 for better performance?

I'm padding 24bit rgb with one zero byte, using the code from Fast 24-bit array -> 32-bit array conversion?.

    static const __m128i mask = _mm_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1);
    for (size_t row = 0; row < height; ++row)
    {
        for (size_t column = 0; column < width; column += 16)
        {
            const __m128i *src = reinterpret_cast<const __m128i *>(in + row * in_pitch + column + (column << 1));
            __m128i *dst = reinterpret_cast<__m128i *>(out + row * out_pitch + (column << 2));
            __m128i v[4];
            v[0] = _mm_load_si128(src);
            v[1] = _mm_load_si128(src + 1);
            v[2] = _mm_load_si128(src + 2);
            v[3] = _mm_shuffle_epi8(v[0], mask);
            _mm_store_si128(dst, v[3]);
            v[3] = _mm_shuffle_epi8(_mm_alignr_epi8(v[1], v[0], 12), mask);
            _mm_store_si128(dst + 1, v[3]);
            v[3] = _mm_shuffle_epi8(_mm_alignr_epi8(v[2], v[1], 8), mask);
            _mm_store_si128(dst + 2, v[3]);
            v[3] = _mm_shuffle_epi8(_mm_alignr_epi8(v[2], v[2], 4), mask);
            _mm_store_si128(dst + 3, v[3]);
        }
    }

Problem is that _mm256_shuffle_epi8 shuffles high 128bit and low 128bit separately so the mask can't just replaced with

    _mm256_setr_epi8(0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11, -1, 12, 13, 14, -1, 15, 16, 17, -1, 18, 19, 20, -1, 21, 22, 23, -1);

also _mm_alignr_epi8 needs to be replaced with _mm256_permute2x128_si256 and _mm256_alignr_epi8

Have you made an attempt at this yet ? If so, please post your code so far. If not, then maybe you could post your existing SSE code as a starting point ? — Paul R, Feb 11 '18 at 13:20
What exactly does '24bit to 32bit' mean? Adding an alpha component? Extending each channel's 8 bits to 11,10,11 or 10,12,10? — Jongware, Feb 11 '18 at 13:41
Do you mean 24bit rgb padded with one zero byte? For that you do not need sse3, nor anything else, because it is same thing. — BalticMusicFan, Feb 11 '18 at 22:46
Thanks for comments guys, I edited the question to add details. Please let me know if it's still not clear enough. — Wiki Wang, Feb 12 '18 at 01:24
The in-lane nature of AVX2 means that SSSE3 `pshufb` might still be the best choice. But you should consider doing unaligned loads instead of using `_mm_alignr_epi8`, because modern Intel will bottlenecks on one shuffle per clock before it bottlenecks on one store per clock with your code which does multiple shuffles per store. — Peter Cordes, Feb 12 '18 at 07:12
@PeterCordes One thing I've been wondering about for a while, but never cared enough to test is whether there is a throughput penalty to partially overlapping writes. And if so, whether AVX512 masking can avoid it. This is something I often do to deal with odd-sized writes in place of manual alignment. My guess is that the answer is no to throughput penalty if you don't intend to load it back anytime soon. And no in that masking will not let you avoid forwarding stalls in the case where you do need to load it back. — Mysticial, Feb 12 '18 at 20:28
@Mysticial - my tests have shown no penalty for overlapping writes. Of course there may be penalties unaligned writes overlap a cache line, but there is no particular penalty for a write that overlaps an earlier write. This means that you want to merge together a bunch of small byte segments that have odd sizes, a series of overlapping writes is a good strategy and runs at 1 segment per cycle (if each segment fits in a register, plus some penalties for the inevitable cache line crossing). — BeeOnRope, Feb 12 '18 at 23:09
@Mysticial: my limited testing has found the same thing as Bee's: no throughput penalty for overlapping writes other than cache-line boundaries. I'm pretty confident that store-forwarding still works well from the final store, regardless of whether other earlier stores overlap. — Peter Cordes, Feb 13 '18 at 03:48

BeeOnRope · Accepted Answer · 2018-02-13T04:36:34.390

You can process 8 pixels at a time (24 input bytes and 32 output bytes) reasonably efficiently with AVX2.

You just have to align your loads so that the 24-byte block of pixels you will process are centered in the middle of the 32-byte load, rather than the usual approach of aligning the load to the start of the block of pixels². This means that the lane boundary will fall between pixels 4 and 5, and you'll have the bytes for exactly 4 pixels in each lane. Combined with the appropriate shuffle mask, this should be twice as efficient as SSE.

For example:

Given an input pointer uint8_t input[] you handle the first four pixels with non-SIMD code¹ and then do your first 32-byte load at input[8] so that the low order lane (bytes 0-15) gets the 12 payload bytes for pixels 4,5,6,7 in its high order bytes, followed immediately by the next 12 payload bytes for the next 4 pixels in the high lane. Then you use pshufb to expand the pixels into their correct positions (you need a different mask for each lane, since you are moving the pixes in the low lane towards lower positions, and those in the high lane to higher positions, but this doesn't pose a problem). Then next load would be at input[26] (24 bytes later) and so on.

You should get about 8 pixels per cycle throughput with this approach, for perfectly cached input/output - limited on 1/cycle store throughput and 1/cycle shuffle throughput. Luckily, this approach is compatible with always-aligned stores (since the store increment is 32 bytes). You will have some misaligned loads, but those can still occur at 1/cycle so shouldn't be a bottleneck.

It's worth noting that this type of approach "only works once" in terms of SIMD instruction set widening: it works when you have 2 lanes, but not more (so the same idea wouldn't apply to the 512-bit AVX512 with 4 128-bit lanes).

¹This avoids reading out of bounds before the input array: if you know this is safe, you can avoid this step.

²That is, if you load from addr it is addr + 16 that should be at a pixel boundary ((addr + 16) % 12 == 0), not addr.

Thanks for the answer. But, while the resulting output does indeed give us the correct results, I actually gave this one a go, and it's about 50% slower than the SSE version the OP is referencing. I'm guessing it's due to a combination of the fact that it inevitably does most loads unaligned and that it essentially throws away 25% of all loaded data. I don't recommend this approach. — Kumputer, Feb 03 '19 at 08:11
FWIW, I was actually able to get about a 40% perf increase by using the same SSE3 code and simply enabling the __AVX__ compiler flag which turns on vex encoding. Seems that's going to give us the best win unless someone else has a true AVX2 port of the SSE3 version. We can blame Intel for AVX's lane boundary restrictions. — Kumputer, Feb 03 '19 at 08:26
@Kumputer - what type of machine and can you share your code and benchmark? — BeeOnRope, Feb 03 '19 at 15:04

score 1 · Answer 2 · answered Feb 04 '19 at 09:26

Here is the original SSSE3 code, with some of my own dispatching thrown in.

void DspConvertPcm(f32* pOutBuffer, const s24* pInBuffer, size_t totalSampleCount)
{
    constexpr f32 fScale = static_cast<f32>(1.0 / (1<<23));

    size_t i = 0;
    size_t vecSampleCount = 0;

#if defined(SFTL_SSE2)
    if (CpuInfo::GetSupports_SIMD_I32x8())
    {
        vecSampleCount = DspConvertPcm_AVX2(pOutBuffer, pInBuffer, totalSampleCount);
    }
    else
    if (CpuInfo::GetSupports_SSE3())
    {
        const auto vScale = _mm_set1_ps(fScale);
        const auto mask = _mm_setr_epi8(-1, 0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11);

        constexpr size_t step = 16;
        vecSampleCount = (totalSampleCount / step) * step;

        for (; i < vecSampleCount; i += step)
        {
            const auto* pSrc = reinterpret_cast<const __m128i*>(pInBuffer + i);
            auto* pDst = pOutBuffer + i;

            const auto sa = _mm_loadu_si128(pSrc + 0);
            const auto sb = _mm_loadu_si128(pSrc + 1);
            const auto sc = _mm_loadu_si128(pSrc + 2);

            const auto da = _mm_srai_epi32(_mm_shuffle_epi8(sa, mask), 8);
            const auto db = _mm_srai_epi32(_mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask), 8);
            const auto dc = _mm_srai_epi32(_mm_shuffle_epi8(_mm_alignr_epi8(sc, sb,  8), mask), 8);
            const auto dd = _mm_srai_epi32(_mm_shuffle_epi8(_mm_alignr_epi8(sc, sc,  4), mask), 8);

            //  Convert to float and store
            _mm_storeu_ps(pDst + 0,  _mm_mul_ps(_mm_cvtepi32_ps(da), vScale));
            _mm_storeu_ps(pDst + 4,  _mm_mul_ps(_mm_cvtepi32_ps(db), vScale));
            _mm_storeu_ps(pDst + 8,  _mm_mul_ps(_mm_cvtepi32_ps(dc), vScale));
            _mm_storeu_ps(pDst + 12, _mm_mul_ps(_mm_cvtepi32_ps(dd), vScale));
        }
    }
#endif

    for (; i < totalSampleCount; i += 1)
    {
        pOutBuffer[i] = (static_cast<s32>(pInBuffer[i])) * fScale;
    }
}

If AVX2 is present, it will call into DspConvertPcm_AVX2, which looks like so:

size_t DspConvertPcm_AVX2(f32* pOutBuffer, const s24* pInBuffer, size_t totalSampleCount)
{
    SFTL_ASSERT(CpuInfo::GetSupports_SIMD_I32x8());

    constexpr f32 fScale = static_cast<f32>(1.0 / (1 << 23));
    const auto vScale = _mm256_set1_ps(fScale);

    auto fnDo16Samples = [vScale](f32* pOutBuffer, const s24* pInBuffer)
    {
        const auto vScaleSSE = _mm256_castps256_ps128(vScale);
        const auto mask = _mm_setr_epi8(-1, 0, 1, 2, -1, 3, 4, 5, -1, 6, 7, 8, -1, 9, 10, 11);

        const auto* pSrc = reinterpret_cast<const __m128i*>(pInBuffer);
        auto* pDst = pOutBuffer;

        const auto sa = _mm_loadu_si128(pSrc + 0);
        const auto sb = _mm_loadu_si128(pSrc + 1);
        const auto sc = _mm_loadu_si128(pSrc + 2);

        const auto da = _mm_srai_epi32(_mm_shuffle_epi8(sa, mask), 8);
        const auto db = _mm_srai_epi32(_mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask), 8);
        const auto dc = _mm_srai_epi32(_mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask), 8);
        const auto dd = _mm_srai_epi32(_mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask), 8);

        //  Convert to float and store
        _mm_storeu_ps(pDst +  0, _mm_mul_ps(_mm_cvtepi32_ps(da), vScaleSSE));
        _mm_storeu_ps(pDst +  4, _mm_mul_ps(_mm_cvtepi32_ps(db), vScaleSSE));
        _mm_storeu_ps(pDst +  8, _mm_mul_ps(_mm_cvtepi32_ps(dc), vScaleSSE));
        _mm_storeu_ps(pDst + 12, _mm_mul_ps(_mm_cvtepi32_ps(dd), vScaleSSE));
    };

    //  First 16 samples SSE style
    fnDo16Samples(pOutBuffer, pInBuffer);

    //  Next samples do AVX, where each load will discard 4 bytes at the start and end of each load
    constexpr size_t step = 16;
    const size_t vecSampleCount = ((totalSampleCount / step) * step) - 16;
    {
        const auto mask = _mm256_setr_epi8(-1, 4, 5, 6, -1, 7, 8, 9, -1, 10, 11, 12, -1, 13, 14, 15, -1, 16, 17, 18, -1, 19, 20, 21, -1, 22, 23, 24, -1, 25, 26, 27);
        for (size_t i = 16; i < vecSampleCount; i += step)
        {
            const byte* pByteBuffer = reinterpret_cast<const byte*>(pInBuffer + i);
            auto* pDst = pOutBuffer + i;

            const auto vs24_00_07 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(pByteBuffer -  4));
            const auto vs24_07_15 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(pByteBuffer - 24));

            const auto vf32_00_07 = _mm256_srai_epi32(_mm256_shuffle_epi8(vs24_00_07, mask), 8);
            const auto vf32_07_15 = _mm256_srai_epi32(_mm256_shuffle_epi8(vs24_07_15, mask), 8);

            //  Convert to float and store
            _mm256_storeu_ps(pDst + 0, _mm256_mul_ps(_mm256_cvtepi32_ps(vf32_00_07), vScale));
            _mm256_storeu_ps(pDst + 8, _mm256_mul_ps(_mm256_cvtepi32_ps(vf32_00_07), vScale));
        }
    }

    //  Last 16 samples SSE style
    fnDo16Samples(pOutBuffer + vecSampleCount, pInBuffer + vecSampleCount);

    return vecSampleCount;
}

Note that I did one manual unrolling of the AVX2 main loop to try to speed it up a bit, but it didn't really matter too much.

With a timer strapped on just before the call to DspConvertPcm, which processes 1024 samples at a time, the average processing time here with the AVX2 code path enabled would vary between 2.6 and 3.0 microseconds. On the other hand, if I disable the AVX2 code path, the average time hovered around 2.0 microseconds.

On the other hand, enabling VEX encoding using /arch:AVX2 didn't give me the consistent performance boost that I claimed previously, so that must have been a fluke.

This test was performed on a Haswell core i7-6700HQ @ 2.6 GHz using the default MSVC compiler on Visual Studio 15.9.5 with optimizations enabled for speed and using /fp:fast.

i7-6700HQ is Skylake, not Haswell. Anyway, you're *sign*-extending 24-bit to 32-bit. That's a different problem from RGB to RGBA or RGB0, where you want to fill the A byte with a fixed value (in this case 0). This code looks useful, but posted under the wrong question. Although I suspect @BeeOnRope's method of doing unaligned loads so you only need one shuffle per `srai` would be better, especially on Haswell/Skylake (1/clock shuffle throughput, down from 2 on IvyBridge, but still 2/clock load throughput). — Peter Cordes, Feb 04 '19 at 10:08
You're already doing that for the AVX2 version, it looks like, but not SSSE3, so it will be slower on Haswell/Skylake Pentium/Celeron CPUs that only have SSE4.2. I haven't tried to figure out which will be faster on SnB, or Nehalem. Actual Core 2 with slow unaligned loads (even when not crossing a cache line boundary) will probably benefit from `palignr`, at least Penryn, maybe also Conroe even though Conroe has slow shuffles. And BTW, instead of scalar cleanup you could use overlapping vectors, unless the buffer size is potentially smaller than one unrolled inner loop. — Peter Cordes, Feb 04 '19 at 10:13

How to transform 24bit rgb to 32bit using avx2?

2 Answers2