6

In x86 assembly language, is there any way to obtain the upper half of the EAX register? I know that the AX register already contains the lower half of the EAX register, but I don't yet know of any way to obtain the upper half.

I know that mov bx, ax would move the lower half of eax into bx, but I want to know how to move the upper half of eax into bx as well.

nrz
  • 10,435
  • 4
  • 39
  • 71
Anderson Green
  • 30,230
  • 67
  • 195
  • 328
  • 4
    Just shift it down by 16 bits. – Mysticial Mar 05 '13 at 17:28
  • Either do `shr eax, 16` followed by the move. (which destroys `eax`), or do `mov ebx, eax` and `shr ebx, 16` (which zeros the upper half of `ebx`) I'm not sure if doing operations on `bx` will automatically zero the upper half of `ebx` anyway. So if that's the case, you might as well go with the latter method. – Mysticial Mar 05 '13 at 17:32
  • 3
    `ror eax,16` `mov bx,ax` `ror eax,16` if you want to leave eax/the upper part of ebx untouched – user786653 Mar 05 '13 at 17:44
  • 2
    @Mysticial Doing operations on `bx` does not automatically zero upper half of `ebx`. However, in x86-64 doing operations with `ebx` (but not `bx`, `bl` or `bh`) as dest zeroes the top 32 bits of `rbx`. – nrz Mar 05 '13 at 20:23
  • @nrz Ah. That's good to know. I'm aware of the zeroing behavior on x64. I just wasn't sure if that behavior already existed during the earlier days. – Mysticial Mar 05 '13 at 20:25
  • 3
    `bswap` or some rotate, for example. Also, avoid it if possible. – Jester Jan 24 '18 at 22:59
  • 1
    If you have BMI2, `rorx edx, eax, 16` will copy+rotate efficiently. – Peter Cordes Jan 24 '18 at 23:43
  • 1
    As @PeterCordes (I think) pointed out elsewhere if you don't have BMI2 you can also use `shld ecx, eax, 16` to copy & get the top 16-bits into the lower 16-bits. It's efficient on Intel (1 cycle tput, 3 cycles latency) but sucks on Ryzen (6 !! mops). – BeeOnRope Jan 25 '18 at 23:20

6 Answers6

13

If you want to preserve EAX and the upper half of EBX:

rol eax, 16
mov bx, ax
rol eax, 16

If have a scratch register available, this is more efficient (and doesn't introduce extra latency for later instructions that read EAX):

mov ecx, eax
shr ecx, 16
mov  bx, cx

If you don't need either of those, mov ebx, eax / shr ebx, 16 is the obvious way and avoids any partial-register stalls or false dependencies.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
8

If you don't mind shifting the original value of bx (low 16 bits of ebx) to high 16 bits of ebx, you need only 1 instruction:

shld ebx,eax,16

This does not modify eax.

nrz
  • 10,435
  • 4
  • 39
  • 71
  • Or with BMI2, [`rorx ebx, eax, 16`](http://felixcloutier.com/x86/RORX.html) to set `ebx = word_swap(eax)`. Unlike `shld`, this has no dependency on the old value of EBX, and is a single uop with 1c latency on all CPUs that support it. (http://agner.org/optimize/) – Peter Cordes May 15 '18 at 17:35
4

I would do it like this:

mov ebx,eax
shr ebx, 16

ebx now contains the top 16-bits of eax

Jason
  • 2,341
  • 17
  • 14
2

IMO the best would be to shr (shift right bits) x8 and use AL to get the values you need. The use of AH register is highly unrecommended by optimization manual (from Intel):

3.5.1.12 Zero-Latency MOV Instructions

In processors based on Intel microarchitecture code name Ivy Bridge, a subset of register-to-register move operations are executed in the front end (similar to zero-idioms, see Section 3.5.1.7). This conserves scheduling/execution resources in the out-of-order engine. Most forms of register-to-register. MOVZX are hence Zero-Latency for reg32, reg8 (if not AH/BH/CH/DH)

movzx esi, al ; esi = eax & 0xff
shr eax, 8    ; eax >>= 8;
movzx ecx, al
shr eax, 8
movzx ebx, al
shr eax, 8

You will have first byte in eax, 2nd in ebx, 3rd in ecx and last byte (the one that was the lowest part of eax at the origin) in esi. Also it is nasm syntax I am not familiar with masm so you may need some tweaks.

Antonin GAVREL
  • 9,682
  • 8
  • 54
  • 81
  • 4
    [Reading AH adds 1 cycle of latency on current Intel CPUs (Haswell/Skylake), but has no throughput penalty](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to). `movzx esi, al` / `movzx edi, ah` / `shr eax,16` / repeat is often good. Last I checked, gcc and clang read AH for unpacking the low 2 bytes, but don't use `shr eax,16` to get the next two. – Peter Cordes Jan 24 '18 at 23:41
2

Without knowing the exact purpose, it is hard to determine what would be the best method, but you can tell by the other answer and comments, there is a few different ways to skin this cat. I'm just going to share another example of a method I've used quite often.

        push    ebp
        mov     ebp, esp
        mov     eax, 141f2d72H
        push    eax

Now the contents of memory pointed to by EBP-4 or ESP is;

72 2D 1F 14

Now there are plenty of combinations you can do to address the data as a byte or word.

        mov     al, [bp-1]            AL = 14H      
        mov     ax, [bp-2]            AX = 141FH

I'm not advocating this is a better way than the other examples, just a method I've found to work effectively for some of the stuff I do.

Shift_Left
  • 1,208
  • 8
  • 17
  • 2
    Store/reload has at least 5 cycle latency (or sometimes 4 on SKL). This is normally a bad option, unless you're bottlenecked on ALU throughput. You should definitely use `movzx` loads, or 8-bit / 16-bit ALU instructions with memory operands, though, not `mov ax, [ebp-2]` – Peter Cordes Jan 25 '18 at 23:58
1

For 16-bit mode, this is the smallest (not fastest) code: only 4 bytes.

push eax  ; 2 bytes: prefix + opcode 
pop ax    ; 1 byte: opcode
pop bx    ; 1 byte: opcode

It's even more compact than single instruction shld ebx,eax,16 which occupies 5 bytes of memory.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
kiasari
  • 111
  • 1
  • 10
  • Those instruction sizes are for 16-bit mode. In 32-bit code, it's 5 bytes total and `shld ebx,eax,16` is 4 bytes. (And `shld` is faster on many CPUs.) – Peter Cordes May 15 '18 at 17:27
  • Thanks Peter. `shld ebx, eax, 16` occupies 5 bytes (66 0F A4 D8 10). `shld r32, r32, cl` and `shld r16, r16, immediate` are 4 bytes. Why are push instructions 5 bytes? – kiasari May 17 '18 at 09:10
  • I said *in 32-bit mode*, where the default operand-size is 32, which is much more common than 16-bit mode on modern computers. (But still mostly obsoleted by 64-bit mode). In 32-bit mode, `pop ax` requires a `66` prefix but `push eax` doesn't, so you get a total of 1 + 2 + 2 = 5. Similarly, `shld r32, r32, imm8` doesn't require any prefixes in 32 or 64-bit mode. (And BTW, you don't want to use `shld r,r,cl`, because it's 4 uops on Skylake vs. 1 for `shld r,r,imm8`. http://agner.org/optimize/) – Peter Cordes May 17 '18 at 09:34