2

I need to load values from uint8 array into 128 NEON register. There is a similar question. But there were no good answers.

My solution is:

uint8_t arr[4] = {1,2,3,4};

//load 4 of 8-bit vals into 64 bit reg
uint8x8_t _vld1_u8 = vld1_u8(arr);

//convert to 16-bit and move to 128-bit reg
uint16x8_t _vmovl_u8 = vmovl_u8(_vld1_u8);

//get low 64 bit and move them to 64-bit reg
uint16x4_t _vget_low_u16 = vget_low_u16(_vmovl_u8);

//convert to 32-bit and move to 128-bit reg
uint32x4_t ld32x4 = vmovl_u16(_vget_low_u16);

This works fine, but it seems to me that this approach is not the fastest. Maybe there is a better and faster way to load 8bit data into 128 reg as 32bit ?

Edit:

Thanks to @FrankH. I've came up with the second version using some hack:

uint8x16x2_t z = vzipq_u8(vld1q_u8(arr), q_zero);
uint8x16_t rr = *(uint8x16_t*)&z;
z = vzipq_u8(rr, q_zero);
ld32x4 = *(uint8x16_t*)&z;

It boils down to this assembly (when compiler optimisations are on):

vld1.8 {d16, d17}, [r5]
vzip.8 q8, q9
vorr   q9, q4, q4
vzip.8 q8, q9

So there are no redundant stores and it's pretty fast. But still it is about x1.5 slower then the first solution.

Community
  • 1
  • 1
Max
  • 16,679
  • 4
  • 44
  • 57

1 Answers1

1

You can do a "double zip" with zeroes:

uint16x4_t zero = 0;

uint32x4_t ld32x4 =
    vreinterpretq_u32_u16(
        vzipq_u8(
            vzip_u8(
                vld1_u8(arr),
                vreinterpret_u8_u16(zero)
            ),
            zero
        )
    );

Since the vreinterpretq_*() are no-ops, this boils down to three instructions. Don't have a crosscompiler around at the moment, can't validate that :(

Edit: Don't get me wrong there ... while vreinterpretq_*() isn't resulting in a Neon instruction, it's not a no-op; that's because it stops the compiler from doing the type of funky things you'd see if you'd instead use widerVal.val[0]. All it tells the compiler is, like:

"you've got a uint8x16x2_t but I want to use only half of that as a uint8x16_t, give me half the registers."

Or:

"you have a uint8x16x2_t but I want to use those regs as a uint32x4_t instead."

I.e. it tells the compilers to alias sets of neon registers - preventing stores/loads to/from the stack as you'd get if you do the explicit sub-set access through the .val[...] syntax.

In a way, the .val[...] syntax "is a hack" but the better method, the use of vreinterpretq_*(), "looks like a hack". Not using it results in more instructions and slower/inferior code.

FrankH.
  • 17,675
  • 3
  • 44
  • 63
  • But vzip_u8 returns uint8x8x2_t while vzipq_u8 requires uint8x16_t. – Max Jul 23 '13 at 13:20
  • Tried this one: ld32x4 = vzipq_u8(vzipq_u8(vld1q_u8(arr), q_zero).val[0], q_zero).val[0]; But it's about 30% slower then my variant. Thanks anyway! – Max Jul 24 '13 at 09:46
  • NO - don't do the `.val[...]` thing. That'll force store/reload. Use `vreinterpretq_*()` - that "converts" a `uint8x8x2_t` into a `uint8x16_t` and a `uint8x16x2_t` into a `uint8x32_t` etc ... depending on the types/sizes, all it does is tell the compiler to interpret the set of two/four neon regs in a different way. – FrankH. Jul 24 '13 at 13:39
  • I've tried to convert this: uint8x8x2_t in_v; uint8x16_t out_v = vreinterpretq_u8_u16(in_v); There is a compiler error: 'No matching function to call'. – Max Jul 24 '13 at 15:09
  • Ah ... I see; problem: http://stackoverflow.com/questions/13711407/data-type-compatibility-with-neon-intrinsics – FrankH. Jul 24 '13 at 17:18
  • Something like `uint8x8x2_t a; uint8x16_t b; __asm__("" : "=q"(b) : "0"(a));` _might_ do the conversion trick that `vreinterpretq_*()` doesn't like. Don't have a crosscompiler at hand to test :( – FrankH. Jul 24 '13 at 19:13