Your problem is that you're trying to use your alpha value as an address instead of as a value. The movd (%0), %%mm0 instruction says to use %0 as a location in memory. So you're saying to load the value pointed by alpha instead of its value. Using movd %0, %%mm0 would solve that problem, but then you'd run into the problem that your alpha value only has a 8-bit type and it needs to be a 32-bit type for it to work with the MOVD instruction. You can solve that problem and the fact the alpha value needs to be multiplied by 256 and broadcast to all 4 16-bit words of the destination register for your algorithm to work by multiplying it by 0x0100010001000100ULL and using the MOVQ instruction.
However, you don't need the MOVD/MOVQ instructions at all. You can let the compiler load the values into MMX registers itself by specifying the y constraint with code like this:
typedef unsigned pixel;
static inline pixel
fade_pixel_mmx_asm(pixel p1, pixel p2, unsigned fade) {
asm("punpcklbw %[zeros], %[p1]\n\t"
"punpcklbw %[zeros], %[p2]\n\t"
"psubw %[p2], %[p1]\n\t"
"pmulhw %[fade], %[p1]\n\t"
"paddw %[p2], %[p1]\n\t"
"packuswb %[zeros], %[p1]"
: [p1] "+&y" (p1), [p2] "+&y" (p2)
: [fade] "y" (fade * 0x0100010001000100ULL), [zeros] "y" (0));
return p1;
}
You'll notice that there's no need for a clobber list here because there's no registers being used that wasn't allocated by the compiler, and no other side effects that the compilers needs to know about. I've left out the necessary EMMS instruction, as you wouldn't want to executed on every pixel. You'll want to insert an asm("emms"); statement after your loop that blends the two surfaces.
Better yet, you don't need to use inline assembly at all. You can use intrinsics instead, and not have to worry about the all the pitfalls of using inline assembly:
#include <mmintrin.h>
static inline pixel
fade_pixel_mmx_intrin(pixel p1, pixel p2, unsigned fade) {
__m64 zeros = (__m64) 0ULL;
__m#64 mfade = (__m64) (fade * 0x0100010001000100ULL);
__m64 mp1 = _m_punpcklbw((__m64) (unsigned long long) p1, zeros);
__m64 mp2 = _m_punpcklbw((__m64) (unsigned long long) p2, zeros);
__m64 ret;
ret = _m_psubw(mp1, mp2);
ret = _m_pmulhw(ret, mfade);
ret = _m_paddw(ret, mp2);
ret = _m_packuswb(ret, zeros);
return (unsigned long long) ret;
}
Similarly to the previous example you need call _m_empty() after your loop to generate the necessary EMMS instruction.
You should also seriously consider just writing the routine in plain C. Autovectorizers are pretty good these days, and it's likely the compiler can generate better code using modern SIMD instructions than what you're trying to do with ancient MMX instructions. For example, this code:
static inline unsigned
fade_component(unsigned c1, unsigned c2, unsigned fade) {
return c2 + (((int) c1 - (int) c2) * fade) / 256;
}
void
fade_blend(pixel *dest, pixel *src1, pixel *src2, unsigned char fade,
unsigned len) {
unsigned char *d = (unsigned char *) dest;
unsigned char *s1 = (unsigned char *) src1;
unsigned char *s2 = (unsigned char *) src2;
unsigned i;
for (i = 0; i < len * 4; i++) {
d[i] = fade_component(s1[i], s2[i], fade);
}
}
With GCC 10.2 and -O3 the above code results in assembly code that uses 128-bit XMM registers and blends 4 pixels at a time in its inner loop:
movdqu xmm5, XMMWORD PTR [rdx+rax]
movdqu xmm1, XMMWORD PTR [rsi+rax]
movdqa xmm6, xmm5
movdqa xmm0, xmm1
punpckhbw xmm1, xmm3
punpcklbw xmm6, xmm3
punpcklbw xmm0, xmm3
psubw xmm0, xmm6
movdqa xmm6, xmm5
punpckhbw xmm6, xmm3
pmullw xmm0, xmm2
psubw xmm1, xmm6
pmullw xmm1, xmm2
psrlw xmm0, 8
pand xmm0, xmm4
psrlw xmm1, 8
pand xmm1, xmm4
packuswb xmm0, xmm1
paddb xmm0, xmm5
movups XMMWORD PTR [rdi+rax], xmm0
Finally even an unvectorized version of the C code maybe near optimal, as the code is simple enough that you're probably going to be memory bound regardless how exactly the blend is implemented.