How did the legacy 3DNow! instruction set store results to memory or integer registers?

Question

Just for fun I'm reviewing legacy (deprecated) instructions from 3DNow! set introduced by AMD, and I'm trying to understand how were they used. All instructions seem to be encoded following this pattern:

instruction destination_MMn_register_operand, source_MMn_register_or_memory_operand

where destinationRegister = destinationRegister -operation- source

Like, for instance, pfadd mm0, mmword ptr [rcx] (0F 0F 01 9E):

Would add 2 packed floats from memory pointed by rcx to 2 packed floats stored in mm0 and keep result in mm0.

So it seems like those 3DNow instructions always have an mm register as a destination.

But how were you supposed to get the results out of those mm registers?

In other words, there's no mov mmword ptr [rcx], mm0, or mov rax, mm0 instructions.

score 3 · Answer 1 · answered Aug 06 '18 at 20:58

3

Actually there are, namely movd and movq. These instructions are not part of 3DNow!, they were already present in MMX which 3DNow! is an extension to. That is also why 3DNow! includes a very incomplete-seeming set of integer operations.

answered Aug 06 '18 at 20:58

harold

61,398
6
86
164

Peter Cordes · Accepted Answer · 2022-06-19T21:46:24.420

As @harold says, storing to memory or extracting to an integer register is already covered by MMX movq (both) or movd (low), or punpckhdq+movd to extract just the high float. (Or with MMXEXT introduced with SSE1, pshufw to copy-and-shuffle into another register, not destroying the original.) Similarly for loading.

 PF2ID  mm0, [esi]     ; 3DNow! load 2 floats and convert to 32-bit integer
; basic MMX instructions to use the result
; could do the same thing with 32-bit FP bit patterns
 movq  [edi], mm0      ; store both
 movd  eax, mm0        ; extract low half
 punpckhdq  mm0, mm0   ; broadcast high half
 movd  edx, mm0        ; extract high half

I used 32-bit addressing modes so this code can work in 32-bit mode for compat with CPUs before K8. In 64-bit mode you have SSE2 which makes 3DNow! mostly pointless. Except for working with exactly 2 floats at a time on CPUs like K8 where 128-bit SIMD instructions like addps run as 2 uops. Or if you had some existing code developed for 3DNow! and haven't ported it to SSE2 yet. 64-bit mode does have movq rax, mm0, just like movq rax, xmm0.

The one thing you can't do is turn an 3DNow! float into an x87 80-bit float without a store/reload.

What might have been potentially useful is a version of EMMS that expands a 32-bit float into an 80-bit x87 long double in st0, along with setting the FPU back into x87 mode instead of MMX mode¹. Or maybe even do that for multiple mm registers into multiple x87 registers?

i.e. it would be a shortcut for movd dword [esp], mm0 / emms / fld dword [esp] to set up for further scalar FP after a SIMD reduction.

Remember that these are IEEE754 floats; you normally don't want them in integer registers unless you're picking apart their bit-fields (e.g. for an exp or log implementation), but you can do that with MMX shift/mask instructions.

PF2ID or PF2IW to convert to 32-bit or 16-bit integer of course give you integer data in MMX registers, at which point you're in normal MMX territory.

But movd and fld are cheap, so they didn't bother making a special instruction just to save the reload latency. Also, it might have been slow to implement as a single instruction. Even though x86 is not a RISC ISA, having one really complex instruction is often slower than multiple simpler instructions, especially before decoding to multiple uops was fully a thing. Look at in-order P5 Pentium for an example of how using a RISCy subset of x86 was more efficient there, allowing it to pipeline and pair better if you avoid instructions like push/pop. (That's all changed; push/pop and memory-destination ALU instructions are fine if you need the load/store anyway, and don't have a use for the value in a register.)

3dNow!'s femms leaves the MMX/3dNow! register contents undefined, only setting the tag words to unused instead of preserving the mapping from MMX registers to/from x87 register contents. See http://refspecs.linuxbase.org/AMD-3Dnow.pdf for an official AMD manual. IDK if AMD's microarchitectures just dropped the register-renaming info or what, but probably making store / femms / x87-load the fast way saves a lot of transistors.

Or even FEMMS is still somewhat slow, so they don't want to encourage coders to leave/re-enter MMX/3dNow! mode at all often.

Fun fact: 3dNow! PREFETCHW (prefetch with write intent) is still used, and has its own CPUID feature bit.

See my answer on What is the effect of second argument in _builtin_prefetch()?

Intel CPUs soon added support for decoding it as a NOP (so software like 64-bit Windows can use it without checking), but Broadwell and later actually prefetch with a RFO to get the cache line in MESI Exclusive state, rather than Shared, so it can flip to Modified without additional off-core traffic.

The CPUID feature bit indicates that it really will prefetch.

Footnote 1:

Remember that the MMX registers alias the x87 registers, so no new OS support was needed to save/restore architectural state on context switches. It wasn't until SSE that we got new architectural state. So it wasn't until SSE2+3dNow! that a 3dNow! float to SSE2 double could make sense without switching back to x87 mode. And you could movq2dq xmm0, mm0 + cvtps2pd xmm0, xmm0.

They could have had a float->double in a mm register, but the fld / fst hardware was only designed for float or double->80-bit and 80-bit->float or double. And the use-case for that is limited; if you're using 3dNow!, just stick to float.

Thanks for the info. Very interesting. Btw, I noted after perusing AMD documentation (linked in my answer) that they refer to single-precision floating point numbers that they use for 3Dnow instructions as having a 24-bit significand. But from what I understand the Intel's traditional 32-bit floats use a 23-bit mantissa. Are 3Dnow's packed floats using a different floating-point format than Intel? — MikeF, Aug 07 '18 at 01:18
@MikeF: I don't think so. Almost certainly just a terminology issue; AMD is counting the implicit bit in the significand. Wikipedia has a nice article (https://en.wikipedia.org/wiki/Single-precision_floating-point_format#IEEE_754_single-precision_binary_floating-point_format:_binary32) which describes IEEE binary32 as 24 bit precision, 23 stored. (For subnormal numbers, the first bit of the significand is 0, instead of the usual 1. So an all-zero or not exponent implies the leading bit.) BTW, "significand" is the preferred terminology, but mantissa is more widely used. It's the same thing. — Peter Cordes, Aug 07 '18 at 01:54
Oh, OK. I was just curious. Who knows, AMD could've invented their own floating point format. It doesn't matter at this point though, as 3DNow is almost 99% dead anyway. I think the only thing that still remains from it is the `prefetchw` instruction. [Windows uses it](https://i.imgur.com/kAXzPqO.png) for pretty much every kernel-mode call to preload some kernel structure there. As for `FEMMS` instruction that you pointed out, I think it's AMD only. It was #UD'ing on all of my Intel systems. — MikeF, Aug 07 '18 at 02:03
@MikeF: Yes, `femms` is part of 3dNow!, and wasn't adopted as an MMX/SSE extension. I mentioned it because CPU vendors design their instruction-set extensions to be convenient for their own current microarchitectures. (Another example of that: Intel's SSE `cvtsi2ss xmm0, eax` leaves the upper bytes of XMM0 unmodified, probably so it can be single uop on Pentium III, which splits 128-bit vector ops into 2. But that short-sighted false dependency has led to gcc choosing to `pxor xmm0,xmm0` first, to avoid the risk of creating a loop-carried dep chain or coupling to a slow dep chain. — Peter Cordes, Aug 07 '18 at 02:03
Oh, Peter, also meant to bring up. Does Intel support `prefetch` instruction (the one without fetching for "write")? It seems like it's AMD-only, a legacy from the 3Dnow set, but Intel documentation is surprisingly silent about it. — MikeF, Aug 07 '18 at 02:06
@MikeF: regular `prefetch` have been supported on Intel for a long time. http://felixcloutier.com/x86/PREFETCHh.html doesn't even list an ISA extension, so it may even predate MMX? The NASM manual's appendix lists when insns were new, even back to 186. https://www.nasm.us/doc/nasmdocb.html says `PREFETCH` was new in Pentium, while `PREFETCHT0/1/2` / `PREFETCHNTA` were new in Katmai (first-gen PIII, so I guess with SSE). IDK what they mean about plain `prefetch`; maybe there was an earlier version of the opcode that ignored the `/r` field in ModRM and just prefetched. — Peter Cordes, Aug 07 '18 at 02:11
Hah, very interesting. Thanks for sharing. I was also curious about that strange behavior of the `cvtsi2ss` instruction. Although IMO, it's one of those SISC instructions that you'd be hard pressed to find in your average code. As for the prefetch, then no, I was referring to an even more ancient instruction. The one with the `0F 0D modR/M` encoding, the one that doesn't even specify which cache level to use: L1, L2. — MikeF, Aug 07 '18 at 02:22
@MikeF: Compilers use `cvtsi2sd` all the time in code that uses FP and integer values together (`double` is more common than `float`). e.g. https://godbolt.org/g/M52CXv. SSE2 kept the unnecessary-dependency behaviour for that instruction too, so gcc emits two dep-breaking `pxor` instructions. (Clang is more optimistic). Fun trick: with AVX you can use the same zeroed register as a no-dependency source, like `vcvtsi2sd xmm0, xmm7, eax`, not destroying the zeroed register. Clang uses this sometimes. Lots of code doesn't get auto-vectorized but is still perf-relevant. — Peter Cordes, Aug 07 '18 at 02:50
@MikeF: Looks like my first guess was insufficient for `0F 0D` prefetch, because clearly it's not the same encoding as the SSE prefetches. IDK why the NASM manual flags that as `PENT,3DNOW`. (Note that `0F 0D /1` is prefetchw, so the /r field must matter for the non-write prefetch on CPUs that actually implement it. — Peter Cordes, Aug 07 '18 at 02:53
pshufw won't work with just MMX as it was an instruction added with SSE1. Your first sentence seems to suggest pshufw works with MMX. — Michael Petch, Jun 19 '22 at 20:16
@MichaelPetch: It seems I wrote this before I knew that some instructions on MMX registers were new with SSE1, with Katmai P3, not Pentium MMX. Thanks. — Peter Cordes, Jun 19 '22 at 20:23
Yeah, I figured as much from the date of the answer. No problem. — Michael Petch, Jun 19 '22 at 20:28
@MichaelPetch: Ok, finished updating, that wasn't the only thing worth revising in the answer. And BTW, Intel's manuals aren't very explicit about CPU features required, e.g. https://www.felixcloutier.com/x86/pshufw doesn't mention anything about SSE1, unfortunately. But ECM's instruction-set reference derives from NASM's appendix is accurate: (https://pushbx.org/ecm/doc/insref.htm) — Peter Cordes, Jun 19 '22 at 20:47

How did the legacy 3DNow! instruction set store results to memory or integer registers?

2 Answers2

Linked