0

I've been looking at Intel machine language, both in the generated code that's shown in an assembly listing, and in a dump of the executable file itself, as generated from a program written in MASM. I can't figure out how the registers are referred to in the machine instructions. My PC (and obviously many others) has 16 registers, so 4 bits are needed to refer to all of them, 0 through 15. As an example I looked at the lea instruction, since it has a 1 one byte opcode, and only one format. Here is the assembler source:

    lea     rax, data2
    lea     rcx, data2
    lea     rdx, data2

Data2 is at offset 5 in the data portion of the program. Here is the generated machine language:

    488D05FE 1F0000 
    488D0DF7 1F0000
    488D15F0 1F0000

I know that the hex 48 denotes a 64 bit register operand and 8D is the opcode, but the rest is still a mystery. What purpose does the 1F0000 serve? Is it reference to the storage location, which is the same in all three instructions? If so, then 05FE, 0DF7, and 15F0 must represent the three registers, but in what notation?

I've spent a lot of time reading https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4, but I don't find it to be very helpful. For instance, it never numbers the bits and bytes in order to describe which bits of an instruction serve which function, and according to what scheme. It's full of details, but largely devoid of explanations.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    It's in the manual somewhere, but you can use [this table](https://sandpile.org/x86/opc_rm.htm) as a handier reference for operand encoding. What you have here is 05, 0D, 15 (FE etc is part of the offset) – harold Jul 04 '20 at 00:08

1 Answers1

4

Let's look at the first instruction: 48 8D 05 FE 1F 00 00

First byte: 48h or 0100 1000 b (I insert a space to make it easier to read). That's an REX prefix byte; see page 535 of the manual you linked (aka Vol. 2A page 2-9 but I'll use absolute page numbers for convenience). It's identified by the top 4 bits being 0100. The remaining four bits are called W, R, X, B respectively. So only W is set, indicating a 64-bit operand size. We'll come back to the others later.

8D is the opcode for LEA, as you know; see page 1149. Since REX.W is set, it will store the effective address of its second operand m in its first operand r64. Looking at the second table on page 1149, the first operand is encoded in the reg field of a ModRM byte, and the second in the r/m field.

05, or 0000 0101 b, is a ModRM byte. See Figure 2-2 on page 530. This is combined with bits from the REX prefix byte as shown on page 535, Figure 2-4 (since we have no SIB byte). This encoding is for backward compatibility with 32-bit instructions.

  • mod is the top two bits of the ModRM byte: 00 for us.

  • reg is the next three bits: 000. The X bit of the REX byte is tacked on as a high bit, yielding X.REG = 0.000.

  • r/m is the low three bits: 101. The B bit of the REX byte is tacked on as a high bit, yielding B.R/M = 0.101.

Now Intel's manual only really seems to explain the meaning of these fields for 32-bit mode; I couldn't find a good explanation there of the 64-bit mode case. So let's look elsewhere, e.g. https://wiki.osdev.org/X86-64_Instruction_Encoding.

The meaning of X.REG is explained here: 0.000, for a 64-bit mode instruction expecting a general-purpose register, is RAX.

For Mod and B.R/M, see this table. Mod is 00 so we look at the first row. B.R/M = 0.101 is marked "[RIP/EIP + disp32]". This means the next four bytes are a 32-bit displacement from the instruction pointer RIP; i.e. an offset from the address of the following instruction. See page 538 of the Intel manual. So that accounts for the last four bytes, which form the little-endian 32-bit number 00001FFEh. In other words, the memory operand is 1FFEh bytes after the address of the next instruction. That's presumably where data2 is located; your assembler or linker calculated the offset for you.

Thus, the first operand was RAX and the second is [RIP+00001FFEh], so the overall instruction is

LEA RAX, [RIP+00001FFEh]

Note the 1F0000 didn't have any significance of its own; it's just the top three bytes of the displacement of the memory operand.

Now the next one is similar: 48 8D 0D F7 1F 00 00. ModRM is now 00 001 101, so X.REG is 0001, which encodes RCX. The Mod and B.R/M again encode RIP+disp32, and the displacement now is 00001FF7h. Note this is exactly 7 less than the for the previous instruction; we have just executed a 7-byte instruction, so RIP has increased by 7, and so a displacement that is 7 less ends up pointing to exactly the same place as before, namely data2.

The last one you can do :)

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • My long range goal is to write a disassembler, but I think it's retreating out of sight. I did the calculation to verify that the 32 bit offset in the instruction really does point to the correct data when added to the address of the following instruction. Not only does it not add up as expected, but the 32 bit offset is a larger number than the entire size of the executable file! Obviously there's something else going on, and no doubt there are some hints in the binary file about how to map it, but so far that also is a mystery. – Robert Watson Jul 07 '20 at 16:42
  • @RobertWatson: Yeah, the loader will adjust that offset at load time, based on where the code and data segments are placed in memory. What you see in the executable file is not what ends up being executed. You may like to try running the program under a debugger and seeing what the instruction looks like in memory after loading and relocation. – Nate Eldredge Jul 07 '20 at 16:53
  • The X in X.Reg is just a placeholder, not the X bit of REX, in https://wiki.osdev.org/X86-64_Instruction_Encoding unfortunately! (I think they wanted a placeholder because *X* can be REX.R (for modrm.reg) or REX.B (for modrm.rm) for register numbers). Note NASM encodes `mov r8d,eax` as `41 89 c0`. So the REX prefix has B=1, other bits clear. (And the other operand order is `44 89 c0 mov eax,r8d`, with REX.R=1). REX.X is unused except with memory operands with indexed addressing modes. – Peter Cordes Mar 14 '21 at 10:54
  • Oh, wrote an answer about that earlier: [What is the "X" part of the encoding of the r8-15 registers?](https://stackoverflow.com/q/64300321) – Peter Cordes Mar 14 '21 at 10:56