ARM Assembly: Is there a way to shift the entirety of a NEON register?

Question

I want to load two, 64-bit integers into a single, 128-bit NEON register then use some right-shift function to, essentially, concatenate the two. I know that (u|s)shr and (u|s)shl exist, but according to all the text I've read, they only work on segments of the register that are 64-bits or less.

For context, I'm trying to improve a bigint library by converting parts of it to Assembly. I want to do this to improve the bigint_shift_right method

So you want to shift with bit granularity across a whole `q` register, not just bytes? (As a special case of that, when the bit-shift count is a multiple of 8 it's just a byte shift. So you can use 16-byte SIMD byte shifts, or maybe do unaligned loads like memmove if you're copying parts of something larger than 16 bytes.) — Peter Cordes, Apr 12 '22 at 02:40
I don't think you can shift the whole register in a single instruction; as you have read, the machine simply doesn't have instructions that cross lanes like that. Of course you can emulate with an appropriate sequence of shifts on 64-bit elements. — Nate Eldredge, Apr 12 '22 at 05:07
Is this ARM32 or ARM64? In ARM64, I think you could do pretty well on the scalar side with `extr`, which extracts an arbitrary 64-bit field from the 128-bit concatenation of two registers. On my Cortex-A72 it's 3 uops, throughput 1 per cycle, so if you unroll a bit, you may be able to keep it memory-bound. Unfortunately the shift count is an immediate, so you'll have to generate 63 separate copies of your function (well, 56 I guess, since as Peter says the multiples of 8 should be done by byte shuffling or unaligned accesses) but that still doesn't seem too awful. — Nate Eldredge, Apr 12 '22 at 05:12
@NateEldredge ARM64. I'll see if `extr` works or not, thanks. — Ethan Parsons, Apr 12 '22 at 19:53
@Jake'Alquimista'LEE unsigned by a variable @NateEldredge unfortunately, `ext` doesn't work because it can only do byte-sized chunks, whereas I need bit precision — Ethan Parsons, Apr 15 '22 at 20:00
@EthanParsons: I said `extr`, not `ext`. The `extr` instruction does extract by bits. That said, I'm not sure that it will be substantially better than the naive shift/or implementation. — Nate Eldredge, Apr 16 '22 at 04:39
You could shift two 128 bits in parallel on ARM64 neon, but the performance gain would be so small that it's not worth the increase in power consumption. Even for a very large array of 128bit. You don't even need assembly for that. In short, pointless. Stick to the standard C. — Jake 'Alquimista' LEE, Apr 16 '22 at 11:01

score 1 · Answer 1 · answered May 23 '22 at 14:23

I faced the same problem recently, but with a fixed shift amount -- let's fix it at 1 for concreteness. I will illustrate with the least significant 2 x 128 bits (call them v0 and v1, least and most significant, respectively), and you can take it from there. We have the following bit indices, with a | separator between 64-bit lanes:

v1      = 255,254,...,193,192|191,190,...,129,128
v0      = 127,126,..., 65, 64| 63, 62,...,  1,  0

You can note that I'm using a left-to-right MSB to LSB convention.

Start by computing, outside of the loop, v0_shr1 = vshrq_u64(v0, 1). Denoting by --- an "empty space" (a shifted-in zero bit), we have:

v0_shr  = ---,127,..., 66, 65|---, 63,...,  2,  1

The rest is done in a loop -- I'll consider only the first iteration. Compute v1_shr1 = vshrq_u64(v1, 1) and v10_ext = vextq_u64(v0, v1, 1). We have:

v1_shr1  = ---,255,...,194,193|---.191,190,...,129
v10_ext  = 191,190,...,129,128|127,126,..., 65, 64

Now compute v0_res = vsliq_n_u64(v0_shr, v10_ext, 63). For ease of visualization, I will recall the value of v0_shr, then show the intermediate bit pattern produced by vsliq_n_u64 (which I'll denote by sli_int), and finally show the result of combining both, which produces the desired result:

v0_shr  = ---,127,..., 66, 65|---, 63,...,  2,  1
sli_int = 128,---,...,---,---| 64,---,...,---,---
v0_res  = 128,127,..., 66, 65| 64, 63,...,  2,  1

A complication, as mentioned in the comments, is that you need to shift by a variable. The above solution won't work directly since the SLI and EXT instructions are encoded with immediates.

First of all, deal with any multiples of 128 in your shift amount by simply starting from a different address in your original array, so you can consider the shift amount modulo 128 (indeed, since unaligned access are supported, this could be used to deal with multiples of 8 so only 8 cases remain, but this could limit performance).

What to do next depends on whether you want to minimize execution time or code size. Note that the former may be pointless if you're memory bound.

To maximize execution time, you'll need to have copies of your routine with different immediate shift amounts. 128 copies would certainly work, but from the previously-linked SO question, it seems like there are no penalties for 64-bit aligned accesses for the CPUs considered, so you could simplify that to 64 copies.

To minimize code size, you could preload the (negated) shift amount modulo 64 in a NEON register, and use vshlq_u64 in place of vshrq_n_u64. You'd also have to replace vsliq_n_u64 with a vshlq_u64/veorq_u64 sequence (this will also require preloading -(64 - shift amount) on a NEON register), which costs an extra instruction per loop iteration.

ARM Assembly: Is there a way to shift the entirety of a NEON register?

1 Answers1