Assembly using rN registers (r8,r9,...) have more code-size than other registers

Question

i have a question about using 'rN' registers (r8,r9,r10,....)

i want to use r8,r9,... registers in my program (a lot usage !) but i found code-size problem !

For example,

mov eax, DWORD [rdi+4]

is equal to (8b 47 04) (disassembler)

but when i use 'rN' registers, it's diffrent !

mov eax, DWORD [r9+4]

it's equal to (41 8b 41 04) (it has an extra BYTE (Prefix) !)

so using rN registers have more code-size than using other registers !!!!!!!!!!!!!!!!!!! first of all, why ?!!!

Second, Besides the code-size problem, is there any other problems like (CACHE, CYCLE, ...) if we use rN registers (r8,r9,r10,...) instead of other registers ?

The x86-64 ISA and the new registers are an extension of the x86-32 ISA, which builds upon the x86-16 ISA. As more instructions and registers were added, the ISA couldn't fit it all, so special *prefixes* were added to tell that the next instruction were using a specific extension of the ISA (like the new 64 bit registers). The byte `41` (hex) is probably such a prefix byte. — Some programmer dude, Jan 09 '20 at 08:50
yes ... i know (as i said it's 'Prefix') but i want to be sure about the CACHE, CYCLE, ..... but also thx (Useful explanation) — ELHASKSERVERS, Jan 09 '20 at 08:53
`0x41` is the `REX.B` prefix, which extends the RM field of the Mod R/M byte in the instruction with one additional bit, which is necessary to access the extended registers. See section **2.2.1.2 More on REX Prefix Fields** in Intel's manual. — Michael, Jan 09 '20 at 08:56
Are you asking whether performance might be worse when using these registers because of increased code size? I suppose it might be for certain scenarios where a hot section of code is close to overflowing a cache line, but I don't know for sure. — 500 - Internal Server Error, Jan 09 '20 at 09:45
most were explained in [The advantages of using 32bit registers/instructions in x86-64](https://stackoverflow.com/q/38303333/995714) — phuclv, Jan 09 '20 at 11:34

score 3 · Answer 1 · answered Jan 09 '20 at 15:38

so using rN registers have more code-size than using other registers

Yes, this is a well known and well-documented fact. REX prefixes are one of the most important changes in x86-64 machine code vs. earlier modes, and the question is only worth answering for the performance part (see below).

x86 machine code only has 3-bit fields for registers. The 4th bit, if non-zero, needs to come from a REX prefix.

This is what AMD64 repurposed the 0x4? opcode bytes for (in 32-bit machine code they're 1-byte inc/dec reg instructions).

To let x86-64 decoding run on the same transistors as 16/32-bit mode decoding, instead of needing a whole new decoder block in the front-end, AMD chose to mostly not redesign x86 machine code from scratch. So they were stuck with 3-bit fields for registers and had to use a prefix byte.

Read Intel's vol.2 manual for more about REX prefixes. Or https://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix has some helpful stuff, including details of what the bits mean. It also explains:

A REX prefix must be encoded when:

using 64-bit operand size and the instruction does not default to 64-bit operand size (most instructions default to 32-bit operand-size); or

using one of the extended registers (R8 to R15, XMM8 to XMM15, YMM8 to YMM15, CR8 to CR15 and DR8 to DR15); or

using one of the uniform byte registers SPL, BPL, SIL or DIL.

And can't be used when using AH, CH, BH or DH. (A REX prefix at all, even with no bits set, changes the meaning of the encoding for AH to mean SPL, and so on.)

(Instructions with a VEX prefix (e.g. AVX and some BMI/BMI2) or EVEX (AVX512) use that instead of REX for extra register bits. A 2-byte VEX can encode X/YMM8..15 as a destination or first source, without needing to use the wider 3-byte VEX prefix.)

Second, Besides the code-size problem, is there any other problems like (CACHE, CYCLE, ...) if we use rN registers (r8,r9,r10,...) instead of other registers ?

No, just code-size (and for some CPUs total number of prefixes). CPUs with a uop cache are mostly not affected much by code-size directly, but indirect effects like larger I-cache footprint (and less dense packing of uop cache) are still a problem. And of course on a large scale, larger binaries.

But some CPUs (notably Silvermont family) are slow to decode instructions with more than 3 prefixes so for example any SSSE3 / SSE4 instruction with a REX prefix stalls the decoders. See Agner Fog's microarch pdf. On Silvermont, even the 0F opcode escape byte for 2-byte opcodes counts as one of the 3, along with mandatory prefixes for SIMD instruction encoding.

  401000:       66 0f 38 00 07          pshufb xmm0,XMMWORD PTR [rdi]   # 3 prefixes before the 00 opcode
  401005:       66 41 0f 38 00 00       pshufb xmm0,XMMWORD PTR [r8]    # 4 prefixes

The latter would be extra slow on Silvermont. Fine on other CPUs that have a 3-prefix limit though (some AMD IIRC); only Silvermont-family counts the 0F byte as a prefix.

Mainstream Intel CPUs can decode an arbitrary number of prefixes without stalling, subject only to the limits of how many bytes of machine code they can look at per clock cycle in the pre-decode stage that finds boundaries between instructions, and the main decode stage that turns up to 5 instructions (or more with macro-fusion) into up to 5 uops. (Skylake) One of those has a length limit of 16 bytes per cycle; IIRC it's pre-decode; check Agner Fog's guide if it matters.

Assembly using rN registers (r8,r9,...) have more code-size than other registers

1 Answers1