How to pack 16 16-bit registers/variables on AVX registers

Question

I use inline assemble, my code like this:

__m128i inl = _mm256_castsi256_si128(in);
__m128i inh = _mm256_extractf128_si256(in, 1); 
__m128i outl, outh;
__asm__(
    "vmovq %2, %%rax                        \n\t"
    "movzwl %%ax, %%ecx                     \n\t"
    "shr $16, %%rax                         \n\t"
    "movzwl %%ax, %%edx                     \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%edx, %%edx), %%edx        \n\t"
    "xorw %4, %%cx                          \n\t"
    "xorw %4, %%dx                          \n\t"
    "rolw $7, %%cx                          \n\t"
    "rolw $7, %%dx                          \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%edx, %%edx), %%edx        \n\t"
    "pxor %0, %0                            \n\t"
    "vpinsrw $0, %%ecx, %0, %0              \n\t"
    "vpinsrw $1, %%edx, %0, %0              \n\t"

: "=x" (outl), "=x" (outh)
: "x" (inl), "x" (inh), "r" (subkey)
: "%rax", "%rcx", "%rdx"
);

I omit some vpinsrw in my code, this is more or less to show the principle. The real code uses 16 vpinsrw operations. But the output doesn't match the expected.

b0f0 849f 446b 4e4e e553 b53b 44f7 552b 67d  1476 a3c7 ede8 3a1f f26c 6327 bbde
e553 b53b 44f7 552b    0    0    0    0 b4b3 d03e 6d4b c5ba 6680 1440 c688 ea36

the first line is the true answer, and the second line is my result. the C code is here:

for(i = 0; i < 16; i++)
{  
    arr[i] = (u16)(s16[arr[i]] ^ subkey);
    arr[i] = (arr[i] << 7) | (arr[i] >> 9);
    arr[i] = s16[arr[i]];

}

My task is make this code faster.

in older code, data move to stack from ymm, and then move to 16 byte register from stack like this . so i want to move data directly to 16 byte register from ymm.

__asm__(     

    "vmovdqa %0, -0xb0(%%rbp)               \n\t"

    "movzwl -0xb0(%%rbp), %%ecx             \n\t"
    "movzwl -0xae(%%rbp), %%eax             \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%eax, %%eax), %%eax        \n\t"
    "xorw %1, %%cx                          \n\t"
    "xorw %1, %%ax                          \n\t"
    "rolw $7, %%cx                          \n\t"
    "rolw $7, %%ax                          \n\t"
    "movzwl s16(%%ecx, %%ecx), %%ecx        \n\t"
    "movzwl s16(%%eax, %%eax), %%eax        \n\t"
    "movw %%cx, -0xb0(%%rbp)                \n\t"
    "movw %%ax, -0xae(%%rbp)                \n\t"

Oh, I'm sorry, variable in is 256 byte store in a ymm register. — Bai, Aug 11 '17 at 08:00
I need to know the input you provided so I can make sure that the output is correct. You just posted the output of the function but not the input you gave it. — fuz, Aug 11 '17 at 09:37
Also, leaving off half the code in your question is not a good way to get a decent answer. — fuz, Aug 11 '17 at 09:37
@huseyintugrulbuyukisik: It makes the compiler-generated asm output (`gcc O3 -S`) a lot more readable than just using `;` to separate instructions and letting C string-literal concatenation pack everything into one line. You can omit the `\n\t` at the end of the last line, but the OP's inline asm formatting style is good. (Instruction-choice OTOH.... not so much.) — Peter Cordes, Aug 13 '17 at 02:14
@PeterCordes I was using perf tool. Thank you, I will add `\n\t` too just in case I need asm output. — huseyin tugrul buyukisik, Aug 13 '17 at 02:17

Peter Cordes · Accepted Answer · 2017-09-05T05:34:16.407

An Skylake (where gather is fast), it might well be a win to chain two gathers together using Aki's answer. That lets you do the rotate very efficiently using vector-integer stuff.

On Haswell, it might be faster to keep using your scalar code, depending on what the surrounding code looks like. (Or maybe doing the vector rotate+xor with vector code is still a win. Try it and see.)

You have one really bad performance mistake, and a couple other problems:

"pxor %0, %0                            \n\t"
"vpinsrw $0, %%ecx, %0, %0              \n\t"

Using a legacy-SSE pxor to zero the low 128b of %0 while leaving the upper 128b unmodified will cause an SSE-AVX transition penalty on Haswell; about 70 cycles each on the pxor and the first vpinsrw, I think. On Skylake, it will only be slightly slower, and have a false dependency.

Instead, use vmovd %%ecx, %0, which zeros the upper bytes of the vector reg (thus breaking the dependency on the old value).

Actually, use

"vmovd        s16(%%rcx, %%rcx), %0       \n\t"   // leaves garbage in element 1, which you over-write right away
"vpinsrw  $1, s16(%%rdx, %%rdx), %0, %0   \n\t"
...

It's a huge waste of instructions (and uops) to load into integer registers and then go from there into vectors, when you could insert directly into vectors.

Your indices are already zero-extended, so I used 64-bit addressing modes to avoid wasting an address-size prefix on each instruction. (Since your table is static, it's in the low 2G of virtual address space (in the default code-model), so 32-bit addressing did actually work, but it gained you nothing.)

I experimented a while ago with getting scalar LUT results (for GF16 multiply) into vectors, tuning for Intel Sandybridge. I wasn't chaining the LUT lookups like you are, though. See https://github.com/pcordes/par2-asm-experiments. I kind of abandoned it after finding out that GF16 is more efficient with pshufb as a 4-bit LUT, but anyway I found that pinsrw from memory into a vector was good if you don't have gather instructions.

You might want to give more ILP by interleaving operations on two vectors at once. Or maybe even into the low 64b of 4 vectors, and combine with vpunpcklqdq. (vmovd is faster that vpinsrw, so it's pretty much break-even on uop throughput.)

"xorw %4, %%cx                          \n\t"
"xorw %4, %%dx                          \n\t"

These can and should be xor %[subkey], %%ecx. 32-bit operand-size is more efficient here, and works fine as long as your input doesn't have any bits set in the upper 16. Use a [subkey] "ri" (subkey) constraint to allow an immediate value when it's known at compile-time. (That's probably better, and reduces register pressure slightly, but at the expense of code-size since you use it many times.)

The rolw instructions have to stay 16-bit, though.

You could consider packing two or four values into an integer register (with movzwl s16(...), %%ecx / shl $16, %%ecx / mov s16(...), %cx / shl $16, %%rcx / ...), but then you'd have to emulate the rotates with shifting / or and masking. And unpack again to reuse them as indices.

It's too bad the integer stuff comes between two LUT lookups, otherwise you could do it in a vector before unpacking.

You strategy for extracting 16b chunks of a vector looks pretty good. movdq from xmm to GP register runs on port 0 on Haswell/Skylake, and shr/ror runs on port0 / port6. So you do compete for ports some, but storing the whole vector and reloading it would take more load ports.

Might be worth trying doing a 256b store, but still get the low 64b from a vmovq so the first 4 elements can get started without as much latency.

As for getting the wrong answer: use a debugger. Debuggers work very well for asm; see the end of the x86 tag wiki for some tips on using GDB.

Look at the compiler-generated code that interfaces between your asm and what the compiler is doing: maybe you got a constraint wrong.

Maybe you got mixed up with %0 or %1 or something. I'd definitely recommend using %[name] instead of operand numbers. See also the inline-assembly tag wiki for links to guides.

C version that avoids inline asm (but gcc wastes instructions on it).

You don't need inline-asm for this at all, unless your compiler is doing a bad job unpacking the vector to 16-bit elements, and not generating the code you want. https://gcc.gnu.org/wiki/DontUseInlineAsm

I put this up on Matt Godbolt's compiler explorer where you can see the asm output.

// This probably compiles to code like your inline asm
#include <x86intrin.h>
#include <stdint.h>

extern const uint16_t s16[];

__m256i LUT_elements(__m256i in)
{
    __m128i inl = _mm256_castsi256_si128(in);
    __m128i inh = _mm256_extractf128_si256(in, 1);

    unsigned subkey = 8;
    uint64_t low4 = _mm_cvtsi128_si64(inl);  // movq extract the first elements
    unsigned idx = (uint16_t)low4;
    low4 >>= 16;

    idx = s16[idx] ^ subkey;
    idx = __rolw(idx, 7);
    // cast to a 32-bit pointer to convince gcc to movd directly from memory
    // the strict-aliasing violation won't hurt since the table is const.

    __m128i outl = _mm_cvtsi32_si128(*(const uint32_t*)&s16[idx]);

    unsigned idx2 = (uint16_t)low4;
    idx2 = s16[idx2] ^ subkey;
    idx2 = __rolw(idx2, 7);
    outl = _mm_insert_epi16(outl, s16[idx2], 1);

    // ... do the rest of the elements

    __m128i outh = _mm_setzero_si128();  // dummy upper half
    return _mm256_inserti128_si256(_mm256_castsi128_si256(outl), outh, 1);
}

I had to pointer-cast to get a vmovd directly from the LUT into a vector for the first s16[idx]. Without that, gcc uses a movzx load into an integer reg and then a vmovd from there. That avoids any risk of a cache-line split or page-split from doing a 32-bit load, but that risk may be worth it for average throughput since this probably bottlenecks on front-end uop throughput.

Note the use of __rolw from x86intrin.h. gcc supports it, but clang doesn't. It compiles to a 16-bit rotate with no extra instructions.

Unfortunately gcc doesn't realize that the 16-bit rotate keeps the upper bits of the register zeroed, so it does a pointless movzwl %dx, %edx before using %rdx as an index. This is a problem even with gcc7.1 and 8-snapshot.

And BTW, gcc loads the s16 table address into a register, so it can use addressing modes like vmovd (%rcx,%rdx,2), %xmm0 instead of embedding the 4-byte address into every instruction.

Since the extra movzx is the only thing gcc is doing worse than you could do by hand, you might consider just making a rotate-by-7 function in inline asm that gcc thinks takes 32 or 64-bit input registers. (Use something like this to get a "half" sized rotate, i.e. 16 bits:

// pointer-width integers don't need to be re-extended
// but since gcc doesn't understand the asm, it thinks the whole 64-bit result may be non-zero
static inline
uintptr_t my_rolw(uintptr_t a, int count) {
    asm("rolw %b[count], %w[val]" : [val]"+r"(a) : [count]"ic"(count));
    return a;
}

However, even with that, gcc still wants to emit useless movzx or movl instructions. I got rid of some zero-extension by using wider types for idx, but there are still problems. (source on the compiler explorer). Having subkey a function arg instead of compile-time constant helps, for some reason.

You might be able to get gcc to assume something is a zero-extended 16-bit value with:

if (x > 65535)
    __builtin_unreachable();

Then you could completely drop any inline asm, and just use __rolw.

But beware that icc will compile that to an actual check and then a jump beyond the end of the function. It should work for gcc, but I didn't test.

It's pretty reasonable to just write the whole thing in inline asm if it takes this much tweaking to get the compiler not to shoot itself in the foot, though.

Are you aware of any published benchmarks showing how much faster gather is on Skylake? Last results I've seen (maybe from Broadwell) indicated you were better off doing multiple loads and inserts instead. — Jason R, Aug 19 '17 at 15:00
@JasonR: Agner Fog's instruction tables indicate that Skylake `vgatherdps ymm` is down to 4 fused-domain uops, and runs at one per 5c throughput. That's still slightly worse than 2 loads per cycle you could get with scalar loads if you can avoid front-end bottlenecks (and much lower front-end issue cost). On Broadwell this algo might be about break-even for gather vs. unpack to scalar to chain the LUT lookups. On Skylake it's almost certainly a win to chain gathers, since scalar bottlenecks on the front-end (because of all the extra ALU work for the rotates). — Peter Cordes, Aug 19 '17 at 15:08
Thank you very much, how do you know so much about intel instructions? — Bai, Sep 03 '17 at 03:29
@Bai: Computer architecture (how CPUs are designed and work internally) has always interested me. A lot of what I know about performance is from reading Agner Fog's guides, and from looking at compiler output to see if it's doing a good job. Also from trying to use that information to optimize real code. (either in asm or [by tweaking the C to get the compiler to make better code](https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466.) When I wonder about something, I look it up in Intel's PDF manuals! — Peter Cordes, Sep 03 '17 at 03:35
@Bai: Also, I've learned a *lot* just from answering SO questions. — Peter Cordes, Sep 03 '17 at 03:36
@Bai: I sometimes notice something weird while optimizing, and if I have time I do some further experiments to figure out what's going on. For example, this HSW/SKL partial-register behaviour wasn't documented anywhere (and Agner Fog's microarch pdf was wrong), so it took some work to write this Q&A: https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to. But it was fun. Since that's the sort of thing I do for fun, I end up knowing a lot about asm instructions :P — Peter Cordes, Sep 03 '17 at 03:38
Thank you, I think you're great, I will try to reach you, thank you very much! — Bai, Sep 04 '17 at 01:46
I have a question to bother you, thank you, I can not run my code when i use the function my_rolw, the error message is 'operand type mismatch for "rol" '. my gcc version is 5.4.0. thanks! — Bai, Sep 04 '17 at 08:59
and I also can not use a [subkey] "ri" (subkey) constraint to allow an immediate value when it's known at compile-time. — Bai, Sep 04 '17 at 09:02
@Bai: Oops, I forgot to use a modifier to get `%cl` instead of `%ecx`. If you look at the compiler-output asm (https://godbolt.org/g/NxPDZh) after it substitutes operands into the asm template, it's easy to see the problem: `rolw %ecx, %ax`. (Clang uses its built-in assembler by default, so it errors even in asm-output mode.) — Peter Cordes, Sep 05 '17 at 05:30
@Bai: Why can't you use `"ri"`? What error message do you get? If the value isn't a compile-time constant, it will just use a register. Or is the 16-bit immediate a problem (LCP decoding stalls on Intel CPUs)? — Peter Cordes, Sep 05 '17 at 05:39

score 2 · Answer 2 · answered Aug 12 '17 at 06:32

The inline assembler resembles slightly the C code, so I would be tempted to assume that these two are meant to be the same.

This is primarily an opinion, but I would suggest using intrinsics instead of the extended assembler. Intrinsics allow register allocation and variable optimization done by the compiler, as well as portability -- each vector operation can be emulated by a function in absence of the target instruction set.

Next issue is that inlined source code appears to handle the substitution block arr[i] = s16[arr[i]] for two indices i only. Using AVX2, this should be done by either two gather operations, since a Y-register can hold only 8 uint32_ts or offsets to the lookup table, OR when it's available, the substitution stage should be performed by analytical functions that can be run in parallel.

Using intrinsics, the operation could look something like this.

__m256i function(uint16_t *input_array, uint16_t subkey) {
  __m256i array = _mm256_loadu_si256((__m256i*)input_array);
          array = _mm256_xor_si256(array, _mm256_set_epi16(subkey));
  __m256i even_sequence = _mm256_and_si256(array, _mm256_set_epi32(0xffff));
  __m256i odd_sequence = _mm256_srli_epi32(array, 16);
  even_sequence = _mm256_gather_epi32(LUT, even_sequence, 4);
  odd_sequence = _mm256_gather_epi32(LUT, odd_sequence, 4);
  // rotate
  __m256i hi = _mm256_slli_epi16(even_sequence, 7);
  __m256i lo = _mm256_srli_epi16(even_sequence, 9);
  even_sequence = _mm256_or_si256(hi, lo);
  // same for odd
  hi = _mm256_slli_epi16(odd_sequence, 7);
  lo = _mm256_srli_epi16(odd_sequence, 9);
  odd_sequence = _mm256_or_si256(hi, lo);
  // Another substitution
  even_sequence = _mm256_gather_epi32(LUT, even_sequence, 4);
  odd_sequence = _mm256_gather_epi32(LUT, odd_sequence, 4);
  // recombine -- shift odd by 16 and OR
  odd_sequence = _mm256_slli_epi32(odd_sequence, 16);
  return _mm256_or_si256(even_sequence, odd_sequence);

}

With optimizations a decent compiler will generate about one assembler instruction per statement; without optimizations all the intermediate variables are spilled to stack to be easily debugged.

Thanks you very much, after reading your answer, I know the _mm256_i32gather_epi32 instruction. but my experimental environment is HASWELL, so maybe like Peter Cordes said, on skylake, gather is fast. — Bai, Aug 15 '17 at 07:23
Fast can be a misnomer in all architectures... Do you have any idea if the substitution function is analytical (e.g. operations on GF2^8), which can be benefit from clmul instruction? — Aki Suihkonen, Aug 15 '17 at 10:05
Thank you very much, my english is not very good.my task is to optimize Kasumi algorithm. I use the array named s16 which has 65536 elements(2 ^ 16), and it satisfies the uniform distribution. so the size of the array more than the size of cache line, so I can't reach the requirement. the above code takes about 90 percent time of the whole algorithm. thanks! — Bai, Aug 15 '17 at 14:54

How to pack 16 16-bit registers/variables on AVX registers

2 Answers2

C version that avoids inline asm (but gcc wastes instructions on it).