Uint8 to mm0 register

Question

I've been playing with the example from this presentation (slide 41).

It performs alpha blending as far as I'm concerned.

MOVQ mm0, alpha//4 16-b zero-padding α
MOVD mm1, A //move 4 pixels of image A 
MOVD mm2, B //move 4 pixels of image B
PXOR mm3 mm3 //clear mm3 to all zeroes 
//unpack 4 pixels to 4 words
PUNPCKLBW mm1, mm3 // Because B -A could be
PUNPCKLBW mm2, mm3 // negative, need 16 bits
PSUBW mm1, mm2 //(B-A) 
PMULHW mm1, mm0 //(B-A)*fade/256 
PADDW mm1, mm2 //(B-A)*fade + B 
//pack four words back to four bytes
PACKUSWB mm1, mm3

I want to rewrite it in c with assembler.

For now, I have something like this:

void fade_mmx(SDL_Surface* im1,SDL_Surface* im2,Uint8 alpha, SDL_Surface* imOut)
{
    int pixelsCount = imOut->w * im1->h;
    
    Uint32 *A = (Uint32*) im1->pixels;
    Uint32 *B = (Uint32*) im2->pixels;
    Uint32 *out = (Uint32*) imOut->pixels;
    Uint32 *end = out + pixelsCount;

    __asm__ __volatile__ (
            "\n\t movd  (%0), %%mm0"
            "\n\t movd  (%1), %%mm1"
            "\n\t movd  (%2), %%mm2"
            "\n\t pxor       %%mm3, %%mm3"
            "\n\t punpcklbw  %%mm3, %%mm1"
            "\n\t punpcklbw  %%mm3, %%mm2"
            "\n\t psubw      %%mm2, %%mm1"
            "\n\t pmulhw     %%mm0, %%mm1"
            "\n\t paddw      %%mm2, %%mm1"
            "\n\t packuswb   %%mm3, %%mm1"
    : : "r" (alpha), "r" (A), "r" (B), "r" (out), "r" (end)
    );
    __asm__("emms" : : );
}

When compiling I get this message: Error: (%dl) is not a valid base/index expression regarding the first line in assembler. I suspect it's because alpha is Uint8, I tried casting it but then I get a segmentation fault. In the example, they're talking about 4 16-b zero-padding α which is not really clear to me.

That `alpha` is 64 bit, as indicated by the `movq` that you wrongly copied as `movd`. It is 4 copies of a 16 bit alpha value. — Jester, Dec 27 '20 at 21:32
That makes sense... So what I need to do is first cast `alpha` as `Uint16` and then put it like 4 times? How can I do this? — , Dec 27 '20 at 21:41
Cast `alpha` to `uint32` and remove the parentheses from `(%0)`. — fuz, Dec 27 '20 at 21:53
I would try `Uint16 alphaArray[4] = { alpha, alpha, alpha, alpha }` or similar. — Jester, Dec 27 '20 at 21:54
*Why* would you translate this? You can do *much* better today, since modern CPUs have wider vector units, and recent microarchitectures have reduced `mmx` execution units because the technology is completely obsolete. Also, [gcc](https://godbolt.org/z/n7ros9) *really* doesn't need any handwritten assembly for this. — EOF, Dec 27 '20 at 21:57
Consider using intrinsics instead of inline assembly if possible. This is less error prone and allows the compiler to make use of SSE instructions to implement the same code if possible. — fuz, Dec 27 '20 at 22:06
Looks like you're doing `punpcklbw %%mm3, %%mm1` twice in a row, instead of into MM1 and MM2. The original intel-syntax source has 3 unpacks, one of which looks extra, like it got repeated to make room for a longer comment??!? — Peter Cordes, Dec 27 '20 at 22:28
@PeterCordes right, thank you, I also have a mistake in the example, I'll fix this. — , Dec 27 '20 at 22:30
You're actually probably better off writing this code in plain C. GCC can autovectorize it pretty well. — Ross Ridge, Dec 28 '20 at 03:01
Please, don't tag this question as `C` if your question is not about C language. You didn't tag it as `gcc` or `linux`, because it was not relevant, so do the same with that gcc extension, as the question is related to assembler code. BTW, check that the linux ABI isn't interferring with your register selection. But again, this is not a C issue. — Luis Colorado, Jan 02 '21 at 15:07

score 2 · Answer 1 · edited Dec 29 '20 at 03:00

You can broadcast alpha to 64-bit using scalar multiply with 0x0001000100010001ULL before copying to an MM reg. Another option would be to just zero-extend the 8-bit integer to 32-bit for movd, then pshufw to replicate it.

There were also various safety problems with your asm.

#include <SDL/SDL.h>
#include <stdint.h>

void fade_mmx(SDL_Surface* im1,SDL_Surface* im2,Uint8 alpha, SDL_Surface* imOut)
{
    int pixelsCount = imOut->w * im1->h;

    Uint32 *A = (Uint32*) im1->pixels;
    Uint32 *B = (Uint32*) im2->pixels;
    Uint32 *out = (Uint32*) imOut->pixels;
    Uint32 *end = out + pixelsCount;

    Uint64 alphas = (Uint64)alpha * 0x0001000100010001ULL;

    __asm__ __volatile__ (
            "\n\t movd  %0, %%mm0"
            "\n\t movd  %1, %%mm1"
            "\n\t movd  %2, %%mm2"
            "\n\t pxor       %%mm3, %%mm3"
            "\n\t punpcklbw  %%mm3, %%mm1"
            "\n\t punpcklbw  %%mm3, %%mm2"
            "\n\t psubw      %%mm2, %%mm1"
            "\n\t pmulhw     %%mm0, %%mm1"
            "\n\t paddw      %%mm2, %%mm1"
            "\n\t packuswb   %%mm3, %%mm1"
    : // you're probably going to want an "=m"(*something) memory output here
    : "r" (alphas), "m" (*A), "m" (*B), "r" (out), "r" (end)
    : "mm0", "mm1", "mm2", "mm3");
    __asm__("emms" : : );
}

The asm statement doesn't need to be volatile if the compiler knows about all the inputs and output, rather than relying on a "memory" clobber. (Like here, no outputs, and only reads registers and memory that are input operands.)

For 32 bit code, replace "r"(alphas) with "m"(alphas). Or use "rm"(alphas) to let the compiler pick. (But for 32-bit, definitely better to use pshufw instead of making the compiler store a 64-bit multiply result as 2 32-bit halves, then suffer a store-forwarding stall when reloading it with movq. Intrinsics would leave the decision up to the compiler with _mm_set1_epi16(alpha), although you only do that once outside the loop anyway).

Note that I have also added the necessary clobber lists and replaced register operands containing pointers you dereference with memory operands referring to the memory you dereference, thus allowing gcc to reason about what memory you access

Please note that if you don't address these things, gcc is going to be unhappy and behaviour of your code will be undefined, likely failing in mysterious and hard to debug ways. Do not use inline assembly unless you understand exactly what you are doing. Consider using intrinsic functions as a safer and potentially more efficient alternative. (https://gcc.gnu.org/wiki/DontUseInlineAsm).

SSE2 with __m128i vectors makes it easy to do 4 pixels at once, not 2 or 1 wasting half of your pack throughput by packing with zeros. (Use punpckhbw to complement punpcklbw to set up for this). MMX is so obsolete that modern CPUs have lower throughput for MMX versions of some instructions than for the equivalent 128-bit SSE2 XMM instructions.

This is still missing some important things to make it safe: telling the compiler that this reads pointed-to memory ([How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259)), that it clobbers mmx regs (and thus destroys x87 state) - probably just clobbers on `"mm0", "mm1" etc would work. Also missing doing anything with the result in mm1; an `"=x"(vec)` output could work. — Peter Cordes, Dec 27 '20 at 22:22
But seriously, https://gcc.gnu.org/wiki/DontUseInlineAsm - that's not the best way to start translating this. Use `_mm_unpacklo_pi16`, or better just use SSE2 for 1, 2, or 4 pixels at once since x86-64 guarantees that. (Of course using unpackhi as well as lo, and doing wide loads). Hmm, the actual asm looks buggy, doing `punpcklbw %%mm3, %%mm1` twice in a row; that's a separate issue for the OP to fix. — Peter Cordes, Dec 27 '20 at 22:26
@PeterCordes I am aware. Given that OPs snippet is woefully incomplete, I assume that this was just to make a MCVE and correct clobbers will be added later on. — fuz, Dec 27 '20 at 22:57
I think it's a bad idea to let unsafe inline `asm` go un-commented-on on Stack Overflow. If we assume everyone looking at it will know that this is a minefield of silent problems, there will definitely be future readers who look at this as an example and just use it as written, maybe adding an output operand but *not* register or `"memory"` clobbers. — Peter Cordes, Dec 27 '20 at 23:08
@PeterCordes I agree with your point here. Let me add some comments real quick. — fuz, Dec 27 '20 at 23:22
I'd already started writing something along those lines myself since I was going to edit to apply the OP's `mm1` -> `mm2` change. Merged in a bit of what I was going to add that wasn't already covered. — Peter Cordes, Dec 27 '20 at 23:39
@PeterCordes Note that gcc and clang have started to implement MMX intrinsics with SSE instructions where supported, so if OP just moves to intrinsics, good code will be generated in either case. — fuz, Dec 28 '20 at 00:04

Ross Ridge · Accepted Answer · 2020-12-29T01:59:21.210

Your problem is that you're trying to use your alpha value as an address instead of as a value. The movd (%0), %%mm0 instruction says to use %0 as a location in memory. So you're saying to load the value pointed by alpha instead of its value. Using movd %0, %%mm0 would solve that problem, but then you'd run into the problem that your alpha value only has a 8-bit type and it needs to be a 32-bit type for it to work with the MOVD instruction. You can solve that problem and the fact the alpha value needs to be multiplied by 256 and broadcast to all 4 16-bit words of the destination register for your algorithm to work by multiplying it by 0x0100010001000100ULL and using the MOVQ instruction.

However, you don't need the MOVD/MOVQ instructions at all. You can let the compiler load the values into MMX registers itself by specifying the y constraint with code like this:

typedef unsigned pixel;

static inline pixel
fade_pixel_mmx_asm(pixel p1, pixel p2, unsigned fade) {
    asm("punpcklbw %[zeros], %[p1]\n\t"
        "punpcklbw %[zeros], %[p2]\n\t"
        "psubw     %[p2], %[p1]\n\t"
        "pmulhw    %[fade], %[p1]\n\t"
        "paddw     %[p2], %[p1]\n\t"
        "packuswb  %[zeros], %[p1]"
        : [p1] "+&y" (p1), [p2] "+&y" (p2)
        : [fade] "y" (fade * 0x0100010001000100ULL), [zeros] "y" (0));
    return p1;
}

You'll notice that there's no need for a clobber list here because there's no registers being used that wasn't allocated by the compiler, and no other side effects that the compilers needs to know about. I've left out the necessary EMMS instruction, as you wouldn't want to executed on every pixel. You'll want to insert an asm("emms"); statement after your loop that blends the two surfaces.

Better yet, you don't need to use inline assembly at all. You can use intrinsics instead, and not have to worry about the all the pitfalls of using inline assembly:

#include <mmintrin.h>

static inline pixel
fade_pixel_mmx_intrin(pixel p1, pixel p2, unsigned fade) {
    __m64 zeros = (__m64) 0ULL;
    __m#64 mfade = (__m64) (fade * 0x0100010001000100ULL);
    __m64 mp1 = _m_punpcklbw((__m64) (unsigned long long) p1, zeros);
    __m64 mp2 = _m_punpcklbw((__m64) (unsigned long long) p2, zeros);

    __m64 ret;
    ret = _m_psubw(mp1, mp2);
    ret = _m_pmulhw(ret, mfade);
    ret = _m_paddw(ret, mp2);
    ret = _m_packuswb(ret, zeros);

    return (unsigned long long) ret;
}

Similarly to the previous example you need call _m_empty() after your loop to generate the necessary EMMS instruction.

You should also seriously consider just writing the routine in plain C. Autovectorizers are pretty good these days, and it's likely the compiler can generate better code using modern SIMD instructions than what you're trying to do with ancient MMX instructions. For example, this code:

static inline unsigned
fade_component(unsigned c1, unsigned c2, unsigned fade) {
    return c2  + (((int) c1 - (int) c2) * fade) / 256;
}

void
fade_blend(pixel *dest, pixel *src1, pixel *src2, unsigned char fade,
           unsigned len) {
    unsigned char *d = (unsigned char *) dest;
    unsigned char *s1 = (unsigned char *) src1;
    unsigned char *s2 = (unsigned char *) src2;
    unsigned i;
    for (i = 0; i < len * 4; i++) {
        d[i] = fade_component(s1[i], s2[i], fade);
    }
}

With GCC 10.2 and -O3 the above code results in assembly code that uses 128-bit XMM registers and blends 4 pixels at a time in its inner loop:

    movdqu  xmm5, XMMWORD PTR [rdx+rax]
    movdqu  xmm1, XMMWORD PTR [rsi+rax]
    movdqa  xmm6, xmm5
    movdqa  xmm0, xmm1
    punpckhbw       xmm1, xmm3
    punpcklbw       xmm6, xmm3
    punpcklbw       xmm0, xmm3
    psubw   xmm0, xmm6
    movdqa  xmm6, xmm5
    punpckhbw       xmm6, xmm3
    pmullw  xmm0, xmm2
    psubw   xmm1, xmm6
    pmullw  xmm1, xmm2
    psrlw   xmm0, 8
    pand    xmm0, xmm4
    psrlw   xmm1, 8
    pand    xmm1, xmm4
    packuswb        xmm0, xmm1
    paddb   xmm0, xmm5
    movups  XMMWORD PTR [rdi+rax], xmm0

Finally even an unvectorized version of the C code maybe near optimal, as the code is simple enough that you're probably going to be memory bound regardless how exactly the blend is implemented.

`(__m64) (fade * 0x0100010001000100ULL)` would probably be better as `_mm_set1_pi16(fade)`. (https://godbolt.org/z/faoos9) When `fade` is known to be correctly zero-extended to 16 or 32 bits already, it's just `movd` + `pshufw` instead of a `movabs r64, imm64`/`imul`. clang and gcc miss this optimization if you write it as a multiply - they aren't looking at the number to turn it back into a shuffle. (clang assumes that the caller extends narrow args to 32-bit, unlike GCC). — Peter Cordes, Dec 29 '20 at 03:15
I also played around with `_mm_cvtsi32_si64` (movd) to avoid zero-extension in integer regs before a `movq mm0, rdi`, but that makes clang `movd` to *xmm* then `movdq2q`. It does seem to help GCC, though, which uses XMM regs for the entire thing despite using MMX intrinsics. — Peter Cordes, Dec 29 '20 at 03:16
But as you say, that's mostly moot: auto-vectorized is better than this hand-vectorized version that only uses half the pack throughput by not bothering with `punpckhbw`. — Peter Cordes, Dec 29 '20 at 03:23
@PeterCordes Note that `pshufw` is from SSE. If SSE is not available, a pair of unpacks will do instead. — fuz, Dec 29 '20 at 14:03
@fuz: Fair point in general, but this is an x86-64 questions so SSE1 and SSE2 are guaranteed available. On 32-bit Pentium II (without SSE) a 32x32 => 64-bit multiply and then 2 stores to feed one movq reload would cause a store-forwarding stall, as well as having significant latency. (Or 2x movd + punpckldq is already comparable cost to 2 unpacks). On in-order P5 Pentium-MMX, IDK if store-forwarding stalls are a thing; in-order Atom doesn't have them. But a `mul r32` takes 9 cycles and isn't pairable, so that sucks a lot and 2 unpacks would definitely be better. So multiply isn't great. :/ — Peter Cordes, Dec 29 '20 at 23:30

Uint8 to mm0 register

2 Answers2