What is causing GCC to store a value on the stack when there are plenty of free registers?

Question

GCC 11.2 (-O3) compiles this

#include <stdint.h>

float flt_abs(float x) {
    return *(float *)&(uint32_t){*(uint32_t *)&x & INT32_MAX};
}

float f(float x) {
    return flt_abs(x - (int)x) * 4;
}

to this

flt_abs:
        movd    eax, xmm0
        and     eax, 2147483647
        movd    xmm0, eax
        ret
f:
        cvttss2si       eax, xmm0
        pxor    xmm1, xmm1
        cvtsi2ss        xmm1, eax
        mov     eax, 2147483647
        subss   xmm0, xmm1
        movd    DWORD PTR [rsp-4], xmm0
        and     eax, DWORD PTR [rsp-4]
        movd    xmm0, eax
        mulss   xmm0, DWORD PTR .LC0[rip]
        ret
.LC0:
        .long   1082130432

(I know it's better to use fabsf here, but just for a test.)

There are plenty of clobber registers left in f, but GCC isn't using any of them but pushing the value to memory for some reason.

Is this a bug in GCC, or an intended result?

If it is intended, what makes it better to push to the stack rather than to use a clobber register.

If it is a bug, how should I file this? I mean,

Is there a mailing list, or some website for filing bugs?
What kind of problem is this? Is there a known or related bug? I want to be more specific about the problem.

The question is about 64-bit, but it would still apply on 32-bit because ecx and edx is usable without push/pop in f.

I've just found that the problem is gone in the "trunk" version of GCC in Godbolt. The devs seems to have noticed the problem beforehand and fixed it.

Maybe because most registers in 32-Bit are callee saved? If it would use another register it would have to push it to the stack either way to then pop the value back before returning. It wouldn't make a difference here, is what I am trying to say. The stack had to be used in either case. — cediwelli, Apr 03 '22 at 08:56
The question is about 64-bit (sorry, I didn't mention it), but in 32-bit `eax`, `ecx`, and `edx` is caller-saved on most platforms, so either `ecx` or `edx` could be used without push/pop in `f`. — xiver77, Apr 03 '22 at 09:01
I just assumed, because there is x86 and not x86_64 as tag and I overlooked `rsp` and `rip`. What I read out of the code is the following: It does the `x - (int)x` in the first 5 lines and then, what I assume is pushing the result of that onto the stack. Normally you would do this to pass it as an argument to a function (which would be `flt_abs` in this case) but it doesn't call the function, it does its `mulss xmm0, DWORD PTR .LC0[rip]` magic, which I don't quite understand and then leaves with `ret` — cediwelli, Apr 03 '22 at 09:06
`*(uint32_t *)&x` invokes UB due to the violation of strict aliasing rule and anything can happen — phuclv, Apr 03 '22 at 10:18
@phuclv The code using `memcpy` instead of pointer casting gets compiled to the exact same code (https://godbolt.org/z/x6afPs99x). I know about strict aliasing rules, but if you think it applies in this case, I'd appreciate it if you can answer [this question](https://stackoverflow.com/q/71724929/17665807). — xiver77, Apr 03 '22 at 10:36
@xiver77 obviously it's UB. GCC gives you the warning right away: https://godbolt.org/z/x6x8o78ha. [Always enable all warnings, read & fix them](https://stackoverflow.com/q/57842756/995714). UB means everything can happen including the compiler produces the correct output — phuclv, Apr 03 '22 at 10:52
@xiver77 Use a union, not `memcpy` or pointer type punning for this sort of stuff. — fuz, Apr 03 '22 at 12:41

score 2 · Answer 1 · answered Apr 03 '22 at 09:26

2

The best option would have been to use andps right on the vector register instead of doing the and on a GPR. Even if using an and on a GPR is fixed in stone, movd-ing the value to memory is a bad way to accomplish that. Basically, GCC is being dumb. There is no good reason for doing it this way, perhaps GCC has some excuse for its mistake but that's the best it can hope for.

Clang gets it right:

f:                                      # @f
    cvttps2dq       xmm1, xmm0
    cvtdq2ps        xmm1, xmm1
    subss   xmm0, xmm1
    andps   xmm0, xmmword ptr [rip + .LCPI1_0]
    mulss   xmm0, dword ptr [rip + .LCPI1_1]
    ret

Clang even does the casts better, I hadn't even thought of that.

answered Apr 03 '22 at 09:26

harold

61,398
6
86
164

Both compilers get it correct (and produce faster code) by using `fabsf` in the first place. But my question was about GCC using the stack instead of a register. The code is the smallest I could tear down from a larger function, while preserving the stack-using behaviour. – xiver77 Apr 03 '22 at 09:32
@xiver77 putting the value in `edx` instead of on the stack makes a bit more sense, that would be the proper solution in another context where the value actually needs to be moved. But that isn't the case here. If GCC had used `movd edx, xmm0`, I would still have called that a mistake, because that move doesn't need to exist at all. – harold Apr 03 '22 at 09:37
@xiver77: Going through memory for the XMM<->GPR round trip is perhaps related to GCC's bad `-mtune=generic` in general: [GCC bug 80820 _mm_set_epi64x shouldn't store/reload for -mtune=haswell](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820), although some (all?) of that has been fixed since I reported it, even for `-mtune=generic` – Peter Cordes Apr 03 '22 at 11:17

What is causing GCC to store a value on the stack when there are plenty of free registers?

1 Answers1