swapping 2 registers in 8086 assembly language(16 bits)

Question

Does someone know how to swap the values of 2 registers without using another variable, register, stack, or any other storage location? thanks!

Like swapping AX, BX.

[XOR swap](http://en.wikipedia.org/wiki/XOR_swap_algorithm) – Michael Oct 20 '14 at 15:23 — Michael, Oct 20 '14 at 15:23
There is an `XCHG` instruction... – Jester Oct 20 '14 at 15:24 — Jester, Oct 20 '14 at 15:24
http://felixcloutier.com/x86/XCHG.html – Peter Cordes Oct 01 '17 at 18:50 — Peter Cordes, Oct 01 '17 at 18:50

Peter Cordes · Answer 1 · 2022-03-14T18:07:02.277

8086 has an instruction for this:

xchg   ax, bx

If you really need to swap two regs, xchg ax, bx is the most efficient way on all x86 CPUs in most cases, modern and ancient including 8086. (You could construct a case where multiple single-uop instructions might be more efficient because of some other weird front-end effect due to surrounding code. Or for 32-bit operand size, where zero-latency mov made a 3-mov sequence with a temporary register better on Intel CPUs).

For code-size; xchg-with-ax only takes a single byte. This is where the 0x90 NOP encoding comes from: it's xchg ax, ax, or xchg eax, eax in 32-bit mode¹. Exchanging any other pair of registers takes 2 bytes for the xchg r, r/m encoding. (+ REX prefix if required in 64-bit mode.)

On an actual 8086 or especially 8088, code-fetch was usually the performance bottleneck, so xchg is by far the best way, especially using the single-byte xchg-with-ax short form.

Footnote 1: (In 64-bit mode, xchg eax, eax would truncate RAX to 32 bits, so 0x90 is explicitly a nop instruction, not also a special case of xchg).

Swapping 8-bit halves of the same 16-bit register with a rotate

On 8086, xchg al, ah is good. On modern CPUs, that xchg is 2 or 3 uops, but rol ax, 8 is only 1 uop with 1 cycle latency (thanks to the barrel shifter). This is one of the exceptions to the rule that xchg is generally best.

For 32-bit / 64-bit registers, 3 mov instructions with a temporary could benefit from mov-elimination where xchg can't on current Intel CPUs. xchg is 3 uops on Intel, all of them having 1c latency and needing an execution unit, so one direction has 2c latency but the other has 1c latency. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more microarchitectural details about how current CPUs implement it.

On AMD Ryzen, xchg on 32/64-bit regs is 2 uops and is handled in the rename stage, so it's like two mov instructions that run in parallel. On earlier AMD CPUs, it's still a 2 uop instruction, but with 1c latency each way.

xor-swaps or add/sub swaps or any other multi-instruction sequence other than mov are pointless compared to xchg for registers. They all have 2 and 3 cycle latency, and larger code-size. The only thing that's worth considering is mov instructions.

Or better, unroll a loop or rearrange your code to not need a swap, or to only need a mov.

(If you're writing in C, modern compilers can save you from yourself, untangling xor swaps so they can potentially optimize through them, or at least implement them with xchg (at -Os) or mov instructions Why is the XOR swap optimized into a normal swap using the MOV instruction?)

Swapping a register with memory

Note that xchg with memory has an implied lock prefix. Do not use xchg with memory unless performance doesn't matter at all, but code-size does. (e.g. in a bootloader). Or if you need it to be atomic and/or a full memory barrier, because it's both.

(Fun fact: the implicit lock behaviour was new in 386. On 8086 through 286, xchg with mem isn't special unless you do lock xchg, so you can use it efficiently. But modern CPUs even in 16-bit mode do treat xchg mem, reg the same as lock xchg)

So normally the most efficient thing to do is use another register:

     ; emulate  xchg [mem], cx  efficiently for modern x86
   movzx  eax, word [mem]
   mov    [mem], cx
   mov    cx, ax

If you need to exchange a register with memory and don't have a free scratch register, xor-swap could in some cases be the best option. Using temp memory would require copying the memory value (e.g. to the stack with push [mem], or first spilling the register to a 2nd scratch memory location before loading+storing the memory operand.)

The lowest latency way by far is still with a scratch register; often you can pick one that isn't on the critical path, or only needs to be reloaded (not saved in the first place, because the value's already in memory or can be recalculated from other registers with an ALU instruction).

; spill/reload another register
push  edx            ; save/restore on the stack or anywhere else

movzx edx, word [mem]    ; or just mov dx, [mem]
mov   [mem], ax
mov   eax, edx

pop   edx            ; or better, just clobber a scratch reg

Two other reasonable (but much worse) options for swapping memory with a register are:

not touching any other registers (except SP):

  ; using scratch space on the stack
  push [mem]           ; [mem] can be any addressing mode, e.g. [bx]
  mov  [mem], ax
  pop  ax              ; dep chain = load, store, reload.

or not touching anything else:

  ; using no extra space anywhere
  xor  ax, [mem]
  xor  [mem], ax        ; read-modify-write has store-forwarding + ALU latency
  xor  ax, [mem]        ; dep chain = load+xor, (parallel load)+xor+store, reload+xor

Using two memory-destination xor and one memory source would be worse throughput (more stores, and a longer dependency chain).

The push/pop version only works for operand-sizes that can be pushed/popped, but xor-swap works for any operand-size. If you can use a temporary on the stack, the save/restore version is probably preferable, unless you need a balance of code-size and speed.

score 0 · Accepted Answer · edited Jan 30 '15 at 01:24

0

You can do it using some mathematical operation. I can give you an idea. Hope it helps!

I have followed this C code:

int i=10; j=20
i=i+j;
j=i-j;
i=i-j;

mov ax,10
mov bx,20
add ax,bx  
//mov command to copy data from accumulator to ax, I forgot the statement, now ax=30
sub bx,ax //accumulator vil b 10
//mov command to copy data from accumulator to bx, I forgot the statement now 
sub ax,bx //accumulator vil b 20
//mov command to copy data from accumulator to ax, I forgot the statement now

edited Jan 30 '15 at 01:24

Neeku

3,646
8
33
43

answered Oct 20 '14 at 15:42

ZAZ

597
3
6

The assembler code by far doesn't represent the C-code! Moreover, why would you want to **copy data from accumulator to ax** when AX is the accumulator?? – Sep Roland Jan 29 '15 at 18:17
5

Why suggest something so complex when you can just use xchg? – prl Sep 24 '17 at 00:39
1

Having this as the accepted answer despite https://stackoverflow.com/a/47021804/552683 below is quite misleading! – Davor Cubranic Sep 09 '20 at 18:22
@DavorCubranic: To be fair, this inefficient answer had been accepted for 3 years before I wrote the answer below. But the OP is still active on SO and could change their accept vote at any time. – Peter Cordes Oct 16 '20 at 11:07

swapping 2 registers in 8086 assembly language(16 bits)

2 Answers2

Swapping 8-bit halves of the same 16-bit register with a rotate

Swapping a register with memory

Linked