Does someone know how to swap the values of 2 registers without using another variable, register, stack, or any other storage location? thanks!
Like swapping AX, BX.
Does someone know how to swap the values of 2 registers without using another variable, register, stack, or any other storage location? thanks!
Like swapping AX, BX.
8086 has an instruction for this:
xchg ax, bx
If you really need to swap two regs, xchg ax, bx is the most efficient way on all x86 CPUs in most cases, modern and ancient including 8086. (You could construct a case where multiple single-uop instructions might be more efficient because of some other weird front-end effect due to surrounding code. Or for 32-bit operand size, where zero-latency mov made a 3-mov sequence with a temporary register better on Intel CPUs).
For code-size; xchg-with-ax only takes a single byte. This is where the 0x90 NOP encoding comes from: it's xchg ax, ax, or xchg eax, eax in 32-bit mode1. Exchanging any other pair of registers takes 2 bytes for the xchg r, r/m encoding. (+ REX prefix if required in 64-bit mode.)
On an actual 8086 or especially 8088, code-fetch was usually the performance bottleneck, so xchg is by far the best way, especially using the single-byte xchg-with-ax short form.
Footnote 1: (In 64-bit mode, xchg eax, eax would truncate RAX to 32 bits, so 0x90 is explicitly a nop instruction, not also a special case of xchg).
On 8086, xchg al, ah is good. On modern CPUs, that xchg is 2 or 3 uops, but rol ax, 8 is only 1 uop with 1 cycle latency (thanks to the barrel shifter). This is one of the exceptions to the rule that xchg is generally best.
For 32-bit / 64-bit registers, 3 mov instructions with a temporary could benefit from mov-elimination where xchg can't on current Intel CPUs. xchg is 3 uops on Intel, all of them having 1c latency and needing an execution unit, so one direction has 2c latency but the other has 1c latency. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more microarchitectural details about how current CPUs implement it.
On AMD Ryzen, xchg on 32/64-bit regs is 2 uops and is handled in the rename stage, so it's like two mov instructions that run in parallel. On earlier AMD CPUs, it's still a 2 uop instruction, but with 1c latency each way.
xor-swaps or add/sub swaps or any other multi-instruction sequence other than mov are pointless compared to xchg for registers. They all have 2 and 3 cycle latency, and larger code-size. The only thing that's worth considering is mov instructions.
Or better, unroll a loop or rearrange your code to not need a swap, or to only need a mov.
(If you're writing in C, modern compilers can save you from yourself, untangling xor swaps so they can potentially optimize through them, or at least implement them with xchg (at -Os) or mov instructions Why is the XOR swap optimized into a normal swap using the MOV instruction?)
Note that xchg with memory has an implied lock prefix. Do not use xchg with memory unless performance doesn't matter at all, but code-size does. (e.g. in a bootloader). Or if you need it to be atomic and/or a full memory barrier, because it's both.
(Fun fact: the implicit lock behaviour was new in 386. On 8086 through 286, xchg with mem isn't special unless you do lock xchg, so you can use it efficiently. But modern CPUs even in 16-bit mode do treat xchg mem, reg the same as lock xchg)
So normally the most efficient thing to do is use another register:
; emulate xchg [mem], cx efficiently for modern x86
movzx eax, word [mem]
mov [mem], cx
mov cx, ax
If you need to exchange a register with memory and don't have a free scratch register, xor-swap could in some cases be the best option. Using temp memory would require copying the memory value (e.g. to the stack with push [mem], or first spilling the register to a 2nd scratch memory location before loading+storing the memory operand.)
The lowest latency way by far is still with a scratch register; often you can pick one that isn't on the critical path, or only needs to be reloaded (not saved in the first place, because the value's already in memory or can be recalculated from other registers with an ALU instruction).
; spill/reload another register
push edx ; save/restore on the stack or anywhere else
movzx edx, word [mem] ; or just mov dx, [mem]
mov [mem], ax
mov eax, edx
pop edx ; or better, just clobber a scratch reg
Two other reasonable (but much worse) options for swapping memory with a register are:
not touching any other registers (except SP):
; using scratch space on the stack
push [mem] ; [mem] can be any addressing mode, e.g. [bx]
mov [mem], ax
pop ax ; dep chain = load, store, reload.
or not touching anything else:
; using no extra space anywhere
xor ax, [mem]
xor [mem], ax ; read-modify-write has store-forwarding + ALU latency
xor ax, [mem] ; dep chain = load+xor, (parallel load)+xor+store, reload+xor
Using two memory-destination xor and one memory source would be worse throughput (more stores, and a longer dependency chain).
The push/pop version only works for operand-sizes that can be pushed/popped, but xor-swap works for any operand-size. If you can use a temporary on the stack, the save/restore version is probably preferable, unless you need a balance of code-size and speed.
You can do it using some mathematical operation. I can give you an idea. Hope it helps!
I have followed this C code:
int i=10; j=20
i=i+j;
j=i-j;
i=i-j;
mov ax,10
mov bx,20
add ax,bx
//mov command to copy data from accumulator to ax, I forgot the statement, now ax=30
sub bx,ax //accumulator vil b 10
//mov command to copy data from accumulator to bx, I forgot the statement now
sub ax,bx //accumulator vil b 20
//mov command to copy data from accumulator to ax, I forgot the statement now