Why doesn't gcc zero the upper values of an XMM register when only using the lower value with SS/SD instructions?

Question

For example with such function,

int fb(char a, char b, char c, char d) {
    return (a + b) - (c + d);
}

gcc's assembly output is,

fb:
        movsx   esi, sil
        movsx   edi, dil
        movsx   ecx, cl
        movsx   edx, dl
        add     edi, esi
        add     edx, ecx
        mov     eax, edi
        sub     eax, edx
        ret

Vaguely, I understand that the purpose of movsx is to remove the dependency from the previous value of the register, but honestly I still don't understand exactly what kind of dependency it is trying to remove. I mean, for example, whether or not there is movsx esi, sil, if some value is being written to esi, then any operation using esi will have to wait, if a value is being read from esi, any operation modifying the value of esi will have to wait, and if esi isn't being used by any operation, the code will continue to run. What difference does movsx make? I cannot say the compiler is doing wrong because movsx or movzx is (almost?) always produced by any compiler whenever loading values smaller than 32-bits.

Apart from my lack of understanding, gcc behaves differently with floats.

float ff(float a, float b, float c, float d) {
    return (a + b) - (c + d);
}

is compiled to,

ff:
        addss   xmm0, xmm1
        addss   xmm2, xmm3
        subss   xmm0, xmm2
        ret

If the same logic was applied, I believe the output should be something like,

ff:
        movd    xmm0, xmm0
        movd    xmm1, xmm1
        movd    xmm2, xmm2
        movd    xmm3, xmm3
        addss   xmm0, xmm1
        addss   xmm2, xmm3
        subss   xmm0, xmm2
        ret

So I'm actually asking 2 questions.

Why does gcc behave differently with floats?
What difference does movsx make?

Peter Cordes · Accepted Answer · 2022-01-18T19:19:01.640

4

The return value is the same width as the args so no extension is needed. The parts of registers outside the type widths are allowed to hold garbage in x86 and x86-64 calling conventions. (This applies to both GP integer and vector registers.)

Except for an undocumented extension which clang depends on, where callers extend narrow args to 32-bit; clang will skip the movsx instructions in your char example. https://godbolt.org/z/Gv5e4h3Eh

Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI? covers both the high garbage and the unofficial extension to the calling convention.

Since you asked about false dependencies, note that compilers do use movaps xmm,xmm to copy a scalar. (e.g. in GCC's missed optimizations in (a-b) + (a-d) we need to subtract from a twice. It's non-commutative so we need a copy: https://godbolt.org/z/Tvx19raa3
Indeed, movss xmm1, xmm0 has a dependency on XMM1 where movaps doesn't, and it would be a false dependency if you didn't actually care about merging with the old high bytes.

(Tuning for Pentium III or Pentium M might make sense to use movss because it was single-uop there, but current GCC -O3 -m32 -mtune=pentium3 -mfpmath=sse uses movaps, spending the 2nd uop to avoid a false dependency. It wasn't until Core2 that the SIMD execution units widened to 128-bit for P6 family, matching Pentium 4.)
C integer promotion rules mean that a+b for narrow inputs is equivalent to (int)a + (int)b. In all x86 / x86-64 ABIs, char is a signed type (unlike on ARM for example), so it needs to be sign extended to int width, not zero extended. And definitely not truncated.

If you truncated the result again by returning a char, compilers could if they wanted just do 8-bit adds. But actually they'll use 32-bit adds and leave whatever high garbage there: https://godbolt.org/z/hGdbecPqv. It's not doing this for dep-breaking / performance, just correctness.

As far as performance, GCC's behaviour of reading the 32-bit reg for a char is good if the caller wrote the full register (which the unofficial extension to the calling convention requires anyway), or on CPUs that don't rename low 8 separately from the rest of the reg (everything other than P6-family: SnB-family only renames high-8 regs, except for original Sandybridge itself. Why doesn't GCC use partial registers?)

PS: there's no such instruction as movd xmm0, xmm0, only a different form of movq xmm0, xmm0 which yes would zero-extend the low 64 bits of an XMM register into the full reg.

If you want to see various compiler attempts to zero-extend the low dword, with/without SSE4.1 insertps, look at asm for __m128 foo(float f) { return _mm_set_ss(f); } in the Godbolt link above. e.g. with just SSE2, zero a register with pxor, then movss xmm1, xmm0. Otherwise, insertps or xor-zero and blendps.

edited Jan 18 '22 at 19:19

answered Jan 18 '22 at 17:55

Peter Cordes

328,167
45
605
847

I saw your linked question. Actually read it many times trying to understand. I still don't understand, quoting from the selected answer, "without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves).". – xiver77 Jan 18 '22 at 18:05
Rephrasing from my OP, if some value is being written to `esi`, for example, then any operation using `esi` will have to wait, if a value is being read from esi, any operation modifying the value of esi will have to wait, and if esi isn't being used by any operation, the code will continue to run. I can add the word *partial* everywhere and still think it makes sense. What exactly does reading or writing partially make a difference? – xiver77 Jan 18 '22 at 18:12
@xiver77: It's saying that `mov al, byte ptr [mem8]` in the example there would have a false dependency on the old value of AL. So yes, if a `char fc(char a)` used `mov al, dil` instead of `movzx eax, dil` or `mov eax, edi`, it would have a false dependency on RAX, where **otherwise RAX would be accessed write-only, never waiting for the old value**. But sign-extending within the *same* register isn't relevant for this; you're already reading part of the full register. – Peter Cordes Jan 18 '22 at 18:13
I think I'll have to gain more base knowledge to have a better understanding. One last question is if `mov al, dil` instead of `mov eax, edi` creates a false depencency on `rax`, wouldn't `movss xmm0 xmm1` instead of `movps xmm0 xmm1` create a false dependency on `xmm0` the same way? However, compilers always seem to use `ss`/`sd` instructions for scalar values. – xiver77 Jan 18 '22 at 18:29
My use of *always* in the last comment was incorrect. Compilers do sometimes zero the upper values before load, but not always. `gcc` seems to do zeroing more often than `clang`. I didn't investigate when exactly it happens, and of course I cannot explain why, but just wanted to clarify. – xiver77 Jan 18 '22 at 18:52
@xiver77: Yes, `movss` would have a false dep. That's why compilers use `movaps` to copy a register, even when they only care about the low scalar. https://godbolt.org/z/8Toeasq7M shows 2 examples: `(a+b) - (a+d)` uses `a` twice. (Although a smarter compiler could still avoid any `mov` instructions because `addss` is commutative; it could rework it to `d+=a` / `a+=b` / `a-=d`. But GCC wastes two `movaps` instructions.) – Peter Cordes Jan 18 '22 at 19:08
@xiver77: Updated my answer with that and some other stuff. – Peter Cordes Jan 18 '22 at 19:19
Maybe the dependency issue is why intel's SIMD intrinsics always zero the upper values when initializing with a single lower value, even when they have instructions that can set only the lowest element. I remember doing memcpy from a float to a __m128 to forcibly keep the upper values uninitialized when I'm not using them, but probably I shouldn't have done it. – xiver77 Jan 18 '22 at 20:08
@xiver77: Maybe, although that's actually one of the worst parts of Intel's intrinsics design. Given a scalar `float`, there's no way to get an `__m128` with "don't care" upper bytes, for zero asm instructions if the float was already in an XMM register. (Like `_mm256_castps128_ps256` does for xmm->ymm). – Peter Cordes Jan 18 '22 at 20:17
See [How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?](https://stackoverflow.com/q/39318496) where the first step would be be that, to set up for `_mm_move_ss` which takes two `__m128` args (for the reg-reg merging form, not the zext load). Unfortunately there's no form that takes `(__m128, float)` args so in the source code you have to create a `__m128` out of the scalar first. Clang's shuffle optimizer can optimize away zero-extension or broadcast of `_mm_set` when the upper part's unused. – Peter Cordes Jan 18 '22 at 20:17

Why doesn't gcc zero the upper values of an XMM register when only using the lower value with SS/SD instructions?

1 Answers1