so using rN registers have more code-size than using other registers
Yes, this is a well known and well-documented fact. REX prefixes are one of the most important changes in x86-64 machine code vs. earlier modes, and the question is only worth answering for the performance part (see below).
x86 machine code only has 3-bit fields for registers. The 4th bit, if non-zero, needs to come from a REX prefix.
This is what AMD64 repurposed the 0x4? opcode bytes for (in 32-bit machine code they're 1-byte inc/dec reg instructions).
To let x86-64 decoding run on the same transistors as 16/32-bit mode decoding, instead of needing a whole new decoder block in the front-end, AMD chose to mostly not redesign x86 machine code from scratch. So they were stuck with 3-bit fields for registers and had to use a prefix byte.
Read Intel's vol.2 manual for more about REX prefixes. Or https://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix has some helpful stuff, including details of what the bits mean. It also explains:
A REX prefix must be encoded when:
- using 64-bit operand size and the instruction does not default to 64-bit operand size (most instructions default to 32-bit
operand-size); or
- using one of the extended registers (R8 to R15, XMM8 to XMM15, YMM8 to YMM15, CR8 to CR15 and DR8 to DR15); or
- using one of the uniform byte registers SPL, BPL, SIL or DIL.
And can't be used when using AH, CH, BH or DH. (A REX prefix at all, even with no bits set, changes the meaning of the encoding for AH to mean SPL, and so on.)
(Instructions with a VEX prefix (e.g. AVX and some BMI/BMI2) or EVEX (AVX512) use that instead of REX for extra register bits. A 2-byte VEX can encode X/YMM8..15 as a destination or first source, without needing to use the wider 3-byte VEX prefix.)
Second, Besides the code-size problem, is there any other problems like (CACHE, CYCLE, ...) if we use rN registers (r8,r9,r10,...) instead of other registers ?
No, just code-size (and for some CPUs total number of prefixes). CPUs with a uop cache are mostly not affected much by code-size directly, but indirect effects like larger I-cache footprint (and less dense packing of uop cache) are still a problem. And of course on a large scale, larger binaries.
But some CPUs (notably Silvermont family) are slow to decode instructions with more than 3 prefixes so for example any SSSE3 / SSE4 instruction with a REX prefix stalls the decoders. See Agner Fog's microarch pdf. On Silvermont, even the 0F opcode escape byte for 2-byte opcodes counts as one of the 3, along with mandatory prefixes for SIMD instruction encoding.
401000: 66 0f 38 00 07 pshufb xmm0,XMMWORD PTR [rdi] # 3 prefixes before the 00 opcode
401005: 66 41 0f 38 00 00 pshufb xmm0,XMMWORD PTR [r8] # 4 prefixes
The latter would be extra slow on Silvermont. Fine on other CPUs that have a 3-prefix limit though (some AMD IIRC); only Silvermont-family counts the 0F byte as a prefix.
Mainstream Intel CPUs can decode an arbitrary number of prefixes without stalling, subject only to the limits of how many bytes of machine code they can look at per clock cycle in the pre-decode stage that finds boundaries between instructions, and the main decode stage that turns up to 5 instructions (or more with macro-fusion) into up to 5 uops. (Skylake) One of those has a length limit of 16 bytes per cycle; IIRC it's pre-decode; check Agner Fog's guide if it matters.