I faced the same problem recently, but with a fixed shift amount -- let's fix it at 1 for concreteness. I will illustrate with the least significant 2 x 128 bits (call them v0 and v1, least and most significant, respectively), and you can take it from there. We have the following bit indices, with a | separator between 64-bit lanes:
v1 = 255,254,...,193,192|191,190,...,129,128
v0 = 127,126,..., 65, 64| 63, 62,..., 1, 0
You can note that I'm using a left-to-right MSB to LSB convention.
Start by computing, outside of the loop, v0_shr1 = vshrq_u64(v0, 1). Denoting by --- an "empty space" (a shifted-in zero bit), we have:
v0_shr = ---,127,..., 66, 65|---, 63,..., 2, 1
The rest is done in a loop -- I'll consider only the first iteration. Compute v1_shr1 = vshrq_u64(v1, 1) and v10_ext = vextq_u64(v0, v1, 1). We have:
v1_shr1 = ---,255,...,194,193|---.191,190,...,129
v10_ext = 191,190,...,129,128|127,126,..., 65, 64
Now compute v0_res = vsliq_n_u64(v0_shr, v10_ext, 63). For ease of visualization, I will recall the value of v0_shr, then show the intermediate bit pattern produced by vsliq_n_u64 (which I'll denote by sli_int), and finally show the result of combining both, which produces the desired result:
v0_shr = ---,127,..., 66, 65|---, 63,..., 2, 1
sli_int = 128,---,...,---,---| 64,---,...,---,---
v0_res = 128,127,..., 66, 65| 64, 63,..., 2, 1
A complication, as mentioned in the comments, is that you need to shift by a variable. The above solution won't work directly since the SLI and EXT instructions are encoded with immediates.
First of all, deal with any multiples of 128 in your shift amount by simply starting from a different address in your original array, so you can consider the shift amount modulo 128 (indeed, since unaligned access are supported, this could be used to deal with multiples of 8 so only 8 cases remain, but this could limit performance).
What to do next depends on whether you want to minimize execution time or code size. Note that the former may be pointless if you're memory bound.
To maximize execution time, you'll need to have copies of your routine with different immediate shift amounts. 128 copies would certainly work, but from the previously-linked SO question, it seems like there are no penalties for 64-bit aligned accesses for the CPUs considered, so you could simplify that to 64 copies.
To minimize code size, you could preload the (negated) shift amount modulo 64 in a NEON register, and use vshlq_u64 in place of vshrq_n_u64. You'd also have to replace vsliq_n_u64 with a vshlq_u64/veorq_u64 sequence (this will also require preloading -(64 - shift amount) on a NEON register), which costs an extra instruction per loop iteration.