Problem description
I'm trying to design the C code unpacking array A of uint32_t elements to array B of uint32_t elements where each element of A is unpacked to two consecutive elements of B so that B[2*i] contains low 16 bits of A[i] and B[2*i + 1] contains high 16 bits of A[i] shifted right, i.e.,
B[2*i] = A[i] & 0xFFFFul;
B[2*i+1] = A[i] >> 16u;
Note the arrays are aligned to 4, have variable length, but A always contains multiple of 4 of uint32_t and the size is <= 32, B has sufficient space for unpacking and we are on ARM Cortex-M3.
Current bad solution in GCC inline asm
As the GCC is not good in optimizing this unpacking, I wrote unrolled C & inline asm to make it speed optimized with acceptable code size and register usage. The unrolled code looks like this:
static void unpack(uint32_t * src, uint32_t * dst, uint8_t nmb8byteBlocks)
{
switch(nmb8byteBlocks) {
case 8:
UNPACK(src, dst)
case 7:
UNPACK(src, dst)
...
case 1:
UNPACK(src, dst)
default:;
}
}
where
#define UNPACK(src, dst) \
asm volatile ( \
"ldm %0!, {r2, r4} \n\t" \
"lsrs r3, r2, #16 \n\t" \
"lsrs r5, r4, #16 \n\t" \
"stm %1!, {r2-r5} \n\t" \
: \
: "r" (src), "r" (dst) \
: "r2", "r3", "r4", "r5" \
);
It works until the GCC's optimizer decides to inline the function (wanted property) and reuse register variables src and dst in the next code. Clearly, due to the ldm %0! and stm %1! instructions the src and dst contain different addresses when leaving switch statement.
How to solve it?
I do not know how to inform GCC that registers used for src and dst are invalid after the last UNPACK macro in last case 1:.
I tried to pass them as output operands in all or only last macro ("=r" (mem), "=r" (pma)) or somehow (how) to include them in inline asm clobbers but it only make the register handling worse with bad code again.
Only one solution is to disable function inlining (__attribute__ ((noinline))), but in this case I lose the advantage of GCC which can cut the proper number of macros and inline it if the nmb8byteBlocks is known in compile time. (The same drawback holds for rewriting code to pure assembly.)
Is there any possibility how to solve this in inline assembly?