How to mark as clobbered input operands (C register variables) in extended GCC inline assembly?

Question

Problem description

I'm trying to design the C code unpacking array A of uint32_t elements to array B of uint32_t elements where each element of A is unpacked to two consecutive elements of B so that B[2*i] contains low 16 bits of A[i] and B[2*i + 1] contains high 16 bits of A[i] shifted right, i.e.,

B[2*i] = A[i] & 0xFFFFul;
B[2*i+1] = A[i] >> 16u;

Note the arrays are aligned to 4, have variable length, but A always contains multiple of 4 of uint32_t and the size is <= 32, B has sufficient space for unpacking and we are on ARM Cortex-M3.

Current bad solution in GCC inline asm

As the GCC is not good in optimizing this unpacking, I wrote unrolled C & inline asm to make it speed optimized with acceptable code size and register usage. The unrolled code looks like this:

static void unpack(uint32_t * src, uint32_t * dst, uint8_t nmb8byteBlocks)
{
    switch(nmb8byteBlocks) {
        case 8:
            UNPACK(src, dst)
        case 7:
            UNPACK(src, dst)
        ...
        case 1:
            UNPACK(src, dst)
        default:;
    }
}

where

#define UNPACK(src, dst) \
    asm volatile ( \
        "ldm     %0!, {r2, r4} \n\t" \
        "lsrs    r3, r2, #16 \n\t" \
        "lsrs    r5, r4, #16 \n\t" \
        "stm     %1!, {r2-r5} \n\t" \
        : \
        : "r" (src), "r" (dst) \
        : "r2", "r3", "r4", "r5" \
    );

It works until the GCC's optimizer decides to inline the function (wanted property) and reuse register variables src and dst in the next code. Clearly, due to the ldm %0! and stm %1! instructions the src and dst contain different addresses when leaving switch statement.

How to solve it?

I do not know how to inform GCC that registers used for src and dst are invalid after the last UNPACK macro in last case 1:.

I tried to pass them as output operands in all or only last macro ("=r" (mem), "=r" (pma)) or somehow (how) to include them in inline asm clobbers but it only make the register handling worse with bad code again.

Only one solution is to disable function inlining (__attribute__ ((noinline))), but in this case I lose the advantage of GCC which can cut the proper number of macros and inline it if the nmb8byteBlocks is known in compile time. (The same drawback holds for rewriting code to pure assembly.)

Is there any possibility how to solve this in inline assembly?

Note that you're also missing a `"memory"` clobber; you haven't told GCC about the pointed-to memory also being an input or output, so you need both `volatile` and `"memory"` for this to be safe. [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) — Peter Cordes, Nov 07 '20 at 03:00
If you have NEON, I think it can do this shuffle more efficiently, using 64-bit or 128-bit loads and 128-bit stores. — Peter Cordes, Nov 07 '20 at 04:04

score 2 · Accepted Answer · edited Nov 08 '20 at 03:18

I think you are looking for the + constraint modifier, which means "this operand is both read and written". (See the "Modifiers" section of GCC's inline-assembly documentation.)

You also need to tell GCC that this asm reads and writes memory; the easiest way to do that is by adding "memory" to the clobber list. And that you clobber the "condition codes" with lsrs, so a "cc" clobber is also necessary. Try this:

#define UNPACK(src, dst) \
    asm volatile ( \
        "ldm     %0!, {r2, r4} \n\t" \
        "lsrs    r3, r2, #16 \n\t" \
        "lsrs    r5, r4, #16 \n\t" \
        "stm     %1!, {r2-r5} \n\t" \
        : "+r" (src), "+r" (dst) \
        : /* no input-only operands */ \
        : "r2", "r3", "r4", "r5", "memory", "cc" \
    );

(Micro-optimization: since you don't use the condition codes from the shifts, it's better to use lsr instead of lsrs. It also makes the code easier to read months later; future you won't be scratching your head wondering if there's some reason why the condition codes are actually needed here. EDIT: I've been reminded that lsrs has a more compact encoding than lsr in Thumb format, which is enough of a reason to use it even though the condition codes aren't needed.)

(I would like to say that you'd get better register allocator behavior if you let GCC pick the scratch registers, but I don't know how to tell it to pick scratch registers in a particular numeric order as required by ldm and stm, or how to tell it to use only the registers accessible to 2-byte Thumb instructions.)

(It is possible to specify exactly what memory is read and written with "m"-type input and output operands, but it's complicated and may not improve things much. If you discover that this code works but causes a bunch of unrelated stuff to get reloaded from memory into registers unnecessarily, consult How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

(You may get better code generation for what unpack is inlined into, if you change its function signature to

static void unpack(const uint32_t *restrict src,
                   uint32_t *restrict dst,
                   unsigned int nmb8byteBlocks)

I would also experiment with adding if (nmb8byteBlocks > 8) __builtin_trap(); as the first line of the function.)

`lsrs` is a 2-byte thumb instruction; leaving flags untouched usually requires 4-byte Thumb2 instructions, including `lsr`. — Peter Cordes, Nov 07 '20 at 04:06
@PeterCordes Oh, right. It didn't occur to me to worry about Thumb. — zwol, Nov 07 '20 at 15:08

score 0 · Answer 2 · answered Nov 06 '20 at 17:41

Many thanks zwol, this is exactly what I was looking for but couldn't find it in GCC inline assembly pages. It solved the problem perfectly - now the GCC makes a copy of src and dst in different registers and uses them correctly after the last UNPACK macro.Two remarks:

I use lsrs because it compiles to 2-bytes Cortex-M3 native lsrs. If I use flags untouching lsr version, it compiles to 4-bytes mov.w r3, r2, lsr #16 -> the 16-bit Thumb 2 lsr is with 's' by default. Without the 's', the 32-bit Thumb 2 must be used (I have to check it). Anyway, I should add "cc" in clobbers in this case.
In code above, I removed the nmb8byteBlocks value range check to make it clear. But of course, your last sentence is a good point not only for all C programmers.

This looks like a reply that should be a comment on zwol's answer, not a separate answer. — Peter Cordes, Nov 07 '20 at 04:07

How to mark as clobbered input operands (C register variables) in extended GCC inline assembly?

2 Answers2

Linked

Related