0

I am reading 16 bytes of data in to a __m128i register and processing as 8 bit elements.

Later I need to convert the 16x 8-bit elements in to 16x 32-bit elements.

Obviously this requires 512 bits of storage. However, I presume it's best to avoid AVX 512 as it will reduce the CPU frequency?

Is it possible to convert 16x 8-bit elements from __m128i to two __256i registers, each containing 8x 32-bit elements?

user997112
  • 29,025
  • 43
  • 182
  • 361

1 Answers1

1

The obvious way is two vpmovzxbd ymm,xmm instructions (or sx to sign-extend), asm manual entry. Unfortunately it can only read from the bottom of a vector, so it would take an extra shuffle to get the high 8 bytes of your input lined up. But until AVX-512, there are no lane-crossing shuffles with byte granularity. (vpermb which would need a vector constant and a mask constant to do this in one instruction).

  __m256i lo = _mm256_cvtepu8_epi32(v);
  __m256i hi = _mm256_cvtepu8_epi32(_mm_unpackhi_epi64(v,v));  // vpunpckhqdq + vpmovzxbd

vpunpckhqdq can run on port 1 or 5 on Ice Lake, so it doesn't compete for the main shuffle unit on port 5 that vpmovzx* ymm, xmm needs. https://uops.info/.

AMD Zen doesn't handle vpmovzxbd fully efficiently (single uop) until Zen 4. It might possibly be worth doing vinserti128 and 2x vpshufb with two different vector constants, if you're mainly tuning for AMD.


If you had just loaded this __m128i from memory, you could instead do a 128-bit broadcast-load, vbroadcasti128 ymm, [mem]. Intrinsics for this are inconsistently and poorly designed, with the intrinsics for vbroadcastf128 taking a __m128 const * or __m128d const * pointer (not float*) but the vbroadcasti128 intrinsic taking __m128i by value despite the fact that the instruction only works with a memory source operand. So it requires the compiler to fold a _mm_loadu_si128 into the broadcast, or would tempt a compiler into spilling/reloading a __m128i, or into using vinserti128 to shuffle a register instead.

   const __m256i vbcst = _mm256_broadcastsi128_si256(_mm_loadu_si128((const __m128i*)addr));
   const __m128i v = _mm256_castsi256_si128(vbcst);  // try to convince the compiler to just use the low half for whatever you need

   __m256i lo = _mm256_cvtepu8_epi32(v);       // or with a different shuffle mask if optimizing for AMD
   __m256i hi = _mm256_shuffle_epi8(vbcst, mask);  // byte shuffle within each 128-bit half, which can also zero elements.

However, I presume it's best to avoid AVX 512 as it will reduce the CPU frequency?

Much smaller penalty in Ice Lake than in previous generations, but any need for a frequency/voltage transition causes a significant stall when it happens. (But then if you keep using 512-bit vectors, it'll stay at the new speed/voltage for the rest of the program). See SIMD instructions lowering CPU frequency

Also, it's only 512-bit vector width that can cause a penalty, not a zero-masking vpermb ymm that requires AVX-512 VBMI + VL. (Getting a constant into a mask register would also cost a uop, so it might not save anything unless you're doing this in a loop.)

   __m256i lo = _mm256_cvtepu8_epi32(v);
#ifdef __AVX512VBMI__      // Ice Lake / Zen 4
   __m256i hi = _mm256_maskz_permutexvar_epi8(0x11111111, _mm256_setr_epi32(0,1,2,3,4,5,6,7), v);
#else
  ...
#endif

If you can't make effective use of 512-bit vectors for most of your program, yes, it's often better to only use 256-bit vectors. (Or if you're sharing a CPU with other work, by other threads or other processes.) But that shouldn't stop you from using AVX-512's new shuffles and features when they're useful with 256-bit vectors, if you can assume they're available. (If you'd have to make a different version of the function and dispatch based on run-time detection of AVX-512, it might not be worth it.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Hi Peter. I am only coding for Intel. Yes I am populating the initial `__m128i` from unaligned bytes in memory. – user997112 Apr 03 '23 at 02:59
  • Regarding AVX512 performance degradation, what is the best way to measure this? I typically profile with `__rdtsc()` but this returns cycles. Consequently if the CPU frequency reduces, I doubt `__rdtsc()` would change. Is it best to use `chrono` high resolution clock? – user997112 Apr 03 '23 at 03:00
  • 1
    @user997112: `rdtsc` counts *reference* cycles that tick at a fixed frequency, independent of core clock cycles. See [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/q/13772567) On Linux for example, `perf stat` is a convenient way to count the `cycles` event for a whole program, so you can see both wall-clock time and clock cycles. (And thus also average CPU frequency while running it.) – Peter Cordes Apr 03 '23 at 03:26
  • @user997112: You say "Later I need to convert the 16x 8-bit elements in to 16x 32-bit elements", so I assumed that was the result of some computation. If it's the original bytes from memory that you need to unpack, then yeah you can use `vbroadcasti128` if you're doing this in a loop and having 1 or 2 registers holding shuffle-control vectors is a win vs. 1 extra shuffle uop. – Peter Cordes Apr 03 '23 at 03:28
  • "If it's the original bytes from memory that you need to unpack". Not 100% certain what this means, so i'll give an example of logic similar to what I am doing: I read 16 bytes in to an AVX register, let's say I then add an 8-bit integer to each element (no overflow). Then, in the next stage I multiple each 8-bit element by a 32-bit integer. Obviously now I need each 8-bit element to be expanded to 32-bits. – user997112 Apr 03 '23 at 03:56
  • @user997112: If you read that middle section of my answer about the idea of using `vbroadcasti128` when you load from memory, it should be clear what the restriction is. The byte values you want to unpack to 32-bit have to already be replicated in the high and low halves of a `__m256i`. If you can do the 8-bit integer add on replicated data (e.g. with a replicated constant?), that could work to save one instruction later when you're shuffling. Otherwise not. – Peter Cordes Apr 03 '23 at 04:19
  • Hi Peter, just revisiting this. So, due to `_mm256_cvtepu8_epi32` all we need to do is get the upper 64 bits of the 128, in to the lower part of the 128. I looked at the first solution but I wasn't sure how the `_mm_unpackhi_epi64(v)` was intended, as it requires a second arg and does a interleave. Is there a way to achieve this using a shift? We just want to shift bytes 8 to 15, to 0 to 7? – user997112 May 07 '23 at 22:38
  • @user997112: I meant `_mm_unpackhi_epi64(v,v)`, thanks for spotting the typo. There are many other ways to shuffle the top 8 bytes to the bottom of a `__m128i`, given that we don't care what ends up in the upper 8 bytes of that `__m128i`. Possibilities include `psrldq` byte-shift, `pshufd` dword shuffle, `movhlps` (into any dummy destination, including itself), `palignr same,same,8` rotate, etc. `vpunpckhqdq dst, same,same` is one of the most efficient on modern Intel CPUs, for instruction size (no immediate, 2-byte VEX prefix) and for being able to run on multiple execution ports. – Peter Cordes May 07 '23 at 22:52
  • @user997112: unpackhi_epi64 interleaves 64-bit chunks, but there's only 2 total within the 128-bit lane, so it really just concatenates high halves of the input vectors. Smaller granularity shuffles like `_mm_unpackhi_epi16`, would interleave the data you want with data you don't, that's why I chose `epi64`. – Peter Cordes May 07 '23 at 22:54