1

If we consider two registers ax and bx, how do we swap their contents in Intel IA-32 just by using push and pop? I'm not allowed to use xchg.

This is not question for a homework, I'm revising for an exam.

duplode
  • 33,731
  • 7
  • 79
  • 150
Sorin Cioban
  • 2,237
  • 6
  • 30
  • 36
  • 1
    I'm sorry I posted a poor question, but I didn't know I could pop the stack directly to a register. I knew how to work with the stack, in Java, but didn't know it worked the same way here. – Sorin Cioban Apr 04 '11 at 15:21
  • There's no reason you'd want to ever do this in real life. `xchg ax, bx` is always more efficient. – Peter Cordes Aug 23 '19 at 02:10
  • @PeterCordes is it more efficient though? I've recently heard that swapping 2 registers with push-mov-pop (using the stack as a temporary) and mov-mov-mov (all reg-to-reg moves, using a 3rd register as temporary) have the same performance, and that they are faster than xchg... Any thoughts? – Dada Apr 28 '22 at 09:30
  • @Dada: store/reload would never be better performance. Extra store-forwarding latency on modern CPUs. And extra memory accesses and larger code-size on ancient CPUs like 8088 where total memory accesses (including code-fetch) was the major bottleneck. As for `xchg` vs. 3x `mov`: [Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?](https://stackoverflow.com/q/45766444) explains the full details. For some modern CPUs (Intel especially), 3 `mov` instructions with 32 or 64-bit operand-size could be faster, allowing some zero-latency mov-elimination, but AX is 16-bit. – Peter Cordes Apr 28 '22 at 09:46
  • @Dada: So `mov ax, cx` can't benefit from mov-elimination, so the non-eliminatable 3 uops `xchg ax, bx` decodes to on Intel are equally fast in the back-end. On CPUs before Sandybridge (with its uop cache), decoding a 3-uop instruction could hurt front-end throughput some depending on where it falls, so possible minor win for `mov` on say Pentium III or Nehalem. On AMD since Bulldozer IIRC, `xchg eax, ebx` can decode to only 2 uops, both eliminated `mov`-like operations with zero latency, just copying physical registers without needing a temporary, so major win for `xchg` 32 or 64-bit. – Peter Cordes Apr 28 '22 at 09:49
  • @Dada: Are you maybe thinking of `xchg` with *memory*? That has an implicit `lock` prefix on 386 and later, making it an atomic RMW and full memory barrier, thus much slower. Yes, you should always use a temporary in that case, or to swap memory with memory do 2 loads / 2 stores, not load / xchg / store. But if someone told you that `push`/`mov`/`pop` was faster than `xchg`, I can't imagine a reasonable case where that's accurate. – Peter Cordes Apr 28 '22 at 09:51
  • @PeterCordes Thanks for your reply, and thanks for the link to the post about `xchg`; I'm reading it right now. I've heard that xchg with registers is slower than using mov for swapping, which is why V8 does not use `xchg` (see how [register swapping](https://source.chromium.org/chromium/chromium/src/+/main:v8/src/compiler/backend/x64/code-generator-x64.cc;l=5113-5117) is done on x64 for instance)... It's possible that this choice of V8 was done years ago and isn't valid for modern CPUs... – Dada Apr 28 '22 at 09:56
  • @Dada: It's a pretty reasonable choice; it costs a bit more code size but is otherwise not slower. And can be faster, especially if part of a critical path where mov-elimination is helpful. The lack of any possibility of [mov-elimination](https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all/44193770#44193770) for 16-bit partial registers is a big part of the difference between `xchg ax, bx` and `xchg eax, ebx` on a modern CPU (in any mode). – Peter Cordes Apr 28 '22 at 10:04
  • @Dada: Also for a compiler, if there are any later optimization passes, one of the 3 mov operations could get optimized into something else. But yeah, on the whole `xchg` is a good peephole optimization for swaps on modern CPUs with uop caches (since Sandybridge and Zen) to reduce the front-end impact. Except on Alder Lake E-cores. Turns out to be pretty slow there, like 3 cycle throughput probably because of decoding a multi-uop instruction. Fun fact: GCC `-Os` will use `xchg` as a peephole for swapping registers: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47949 – Peter Cordes Apr 28 '22 at 10:21
  • @PeterCordes Oh right, I didn't realize that mov-elimination wasn't possible for 16-bit registers, but given that mov zeroes the upper part of the registers, if makes sense that mov-elimination is not possible. Regarding "Extra store-forwarding latency on modern CPUs": wouldn't it be possible that the CPU (say Skylake or more recent) recognizes the "pop-mov-push" sequence and performs mov-elimination (assuming 64-bit registers are involved)? – Dada Apr 28 '22 at 10:24
  • @PeterCordes In the case of V8 though, the swap that I linked is part of the code generation phase, and no optimizations happen after. – Dada Apr 28 '22 at 10:27
  • @PeterCordes Re "But if someone told you that push/mov/pop was faster than xchg, I can't imagine a reasonable case where that's accurate.": I'm looking at a benchmark result (comparison xchg, mov-mov-mov and push-mov-pop for 64 bit registers) that reports (on Skylake), that xchg is 1.5 cycles, mov-mov-mov is 1 cycle and push-mov-pop is 1 cycle... Maybe I should redo this benchmark myself and ask a new question on SO if I'm able to reproduce that. – Dada Apr 28 '22 at 10:32
  • 1
    @Dada: writing an 8 or 16-bit register merges with the old contents, [unlike 32-bit](https://stackoverflow.com/questions/11177137/why-do-x86-64-instructions-on-32-bit-registers-zero-the-upper-part-of-the-full-6), that's why mov-elimination isn't possible. (And why it has an output dependency on the destination, except [on old Intel CPUs that rename partial registers](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers).) – Peter Cordes Apr 28 '22 at 10:32
  • @Dada: It's interesting that mov-elimination *is* possible for 32-bit `mov`, not just 64-bit, at least on Intel. IIRC, Bulldozer-family could only mov-eliminate XMM copies, not integer regs. (Not even 64-bit integer regs.) But Zen can mov-eliminate 32 and 64-bit `mov`. Intel can also eliminate `movzx eax, bl`, zero-extension of 8->32 (and thus implicitly to 64). – Peter Cordes Apr 28 '22 at 10:34
  • @Dada: Ok, so yeah, V8 makes the same code-gen choice as GCC `-O2` / `-O3`. It's only `-Os` that looks for `xchg` as a peephole. If not for the slowness on Silvermont-family and especially Alder Lake E-cores, it would be a reasonable choice for big-core modern CPUs, with advantages (code size) and disadvantages (all 3 uops need to be in the same uop-cache line, and latency). – Peter Cordes Apr 28 '22 at 10:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/244286/discussion-between-peter-cordes-and-dada). – Peter Cordes Apr 28 '22 at 10:36

3 Answers3

3

You can either push once and use a mov instruction, or push twice. First one goes like:

push ax
mov ax, bx
pop bx

If you want to push twice, it is (as answered by others):

push ax
push bx
pop ax
pop bx
vhallac
  • 13,301
  • 3
  • 25
  • 36
1
push ax
push bx
pop ax
pop bx

?

ajuc
  • 590
  • 6
  • 12
1

Should be standard use of a stack. Push A, Push B, Pop to A, Pop to B.

This works for IA-32 because its pop doesn't just pop the stack, it also delivers the value it pops. This is not always the case. The Standard Template Library for C++ has a pop that just manipulates the stack and you need a different command to access the top of the stack

Tom Murphy
  • 305
  • 3
  • 12
  • Well you can't compare an assembly operation with a generic method in a library of some high-level language. – kennytm Apr 04 '11 at 15:16
  • 1
    I believe I did. I was referring to the variety of ways pop is represented out in the wild. It was a cautionary advice. – Tom Murphy Apr 04 '11 at 15:58