2

If I have some value (bit pattern, let's say all zeros) in rax, can I store it in st0 without storing and loading from memory?

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386

2 Answers2

3

I think you actually want this for Long latency instruction where movd/movq + sqrtss or sqrtsd works more easily than ?? -> fsqrt.

Coupling an integer dependency into an x87 dependency can be done with fcmovcc through integer flags instead of by transferring a bit-pattern. On Skylake that's a 4-uop instruction, but it's 4 ALU uops.

In the other direction, fcomi+setcc or cmovcc or even just fcom+fnstsw ax.

As @Hadi points out, you can create values in x87 registers with fldz (1 uop) or fld1 (2 uops). Then make that dependent on whatever you want using fcmovne st0, st1 or similar.


Possible answer, not really fully researched since it's unlikely to be useful to you (probably involving microcoded emms: 10 uops on Skylake, 31 uops on Haswell.)

Maybe you can use movq mm0, rax and then emms to leave MMX state. This marks all the x87 register tags as "empty".

On current AMD, FEMMS is identical to EMMS, according to AMD's 2018-may PDF ISA reference manual, but I seem to recall reading that FEMMS on older AMD CPUs left the x87 registers undefined. Maybe that's usable for something. It does still set tag words, so the undefined contents is probably only relevant for cases that were maybe expecting to find mm0..7 contents still there after EMMS and then running another MMX instruction. Or to find the data in fxsave state.

The 64-bit MMX registers alias the significands (mantissas) of the 80-bit x87 registers. (I think). The st0..7 stack maps onto those underlying 80-bit registers starting at the one indexed by the 3-bit TOP field in the x87 status word. (http://www.ray.masmcode.com/tutorial/fpuchap1.htm describes this nicely with the analogy of a revolver barrel).

I'm not sure if this is really usable, but I don't think emms clears mm / st register contents, only x87 tags. (Intel's vol.2 entry for emms says "Sets the values of all the tags in the x87 FPU tag word to empty (all 1s)"

If a floating-point instruction loads one of the registers in the x87 FPU data register stack before the x87 FPU tag word has been reset by the EMMS instruction, an x87 floating-point register stack overflow can occur that will result in an x87 floating-point exception or incorrect result.

It only says "can", not "will". Perhaps with the x87 metadata in a known state, you can mix MMX instructions and x87 instructions with some kind of consistent behaviour? At least on a specific microarchitecture.

You can't read an x87 register whose tag word is 11 (meaning empty), and there's no funfree, just ffree to set a single tag word to 11 (without affecting the Top-of-stack pointer or the contents).

You'd have to fstenv and modify the tag word in the saved x87 state metadata (28 bytes including padding in 32/64-bit mode), then fldenv.

Or you could have a predefined x87 "environment" ready for fldenv with some tag words set to not-in-use. (But Agner Fog doesn't even time that instruction. It's obviously going to be microcoded, and probably slow.) You might be able to use that without emms, but it's still one microcoded instruction.

related: Query about legacy 3DNow! instruction set has some links I dug up a while ago about femms and how little 3dNow! can interact with SSE. Hmm, apparently it does still set the tag word to all unused, but doesn't preserve a known mapping between mm0..7 and x87 registers.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

There are a number of instructions that you can use to load a constant onto the FPU register stack. I think the most useful ones here are FLD1 and FLDZ, which load +1.0 and +0.0 onto the stack, respectively. If the only possible integers that RAX may hold are 1 and 0, then you can conditionally execute FLD1 or FLDZ.

In general, any 64-bit integer in the unsigned or two's complement formats can be represented in the 80-bit extended-precision floating-point format. It's possible to load any integer value found in RAX onto the FPU stack with zero memory accesses. This can be achieved by using a series of x87 arithmetic instructions, possibly executed in a loop whose trip count is maintained in a GP register. One way would be to use FADD (or FSUB for negative integers) until st0 contains the desired integer value, although this can be very slow for large (absolute) values. One possible optimization that works for integers that are powers of 2 is to load the constant +1.0, add one using FADD to get the number 2, and then use FMUL until the desired power of 2 is reached. Another method that may be faster in some cases is to first factorize the integer in RAX, construct each of the prime factors using FADD, and then multiply the prime factors using FMUL.

All of these instructions are supported on all Intel and AMD x86 processors.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95