Which is more useful at an assembly level, 64 registers or three operand instructions?

Question

This question is in the context of writing a C compiler for a 16 bit homebrew CPU.

I have 12 bits of operand for ALU instructions (such as ADD, SUB, AND, etc.).

I could give instructions three operands from 16 registers or two operands from 64 registers.

e.g.

SUB A <- B - C  (registers r0-r15)

vs

SUB A <- A - B  (registers r0-r63)

Are sixteen registers, with three-operand instructions, more useful than 64 registers with two-operand instructions, to C compilers and their authors?

On just a first though (in x86 sorry only assembly I know). Most programs I've put through things like IDA usually use registers Eax through Edx so that's 4. Then you have Ebp and Esp so 6. Eip should not need the ALU. Eflags (again no need for ALU) ESI and EDI makes 8. So just from a first thought I don't think most programs use more then 16 registers. I may be missing a few but I think a good first sanity test for determining this would be looking at what gcc compiles and finding if it even uses more then 16 registers on the ALU. — arduic, May 17 '16 at 12:47
@WeatherVane It's RISC - loads and stores are explicit operations with their own opcode. It's a homebrew-cpu - the only addressing modes for loads and stores are 8 bit immediate offsets from zero, from PC or from another register. — fadedbee, May 17 '16 at 13:08
16 is usually plenty, and 2-operand instructions are slightly annoying to do codegen for. — harold, May 17 '16 at 13:13
if risc you want lots of registers and lots of register based instructions. x86 is cisc, dont use it as a design reference. — old_timer, May 17 '16 at 13:19
Have a look at something like Knuth's [MMIX](http://mmix.cs.hm.edu/). Might give you plenty of ideas, even if it's a clean 64-bit 'RISC' ISA. For a 16-bit CPU, you might look at Atmel's 8-bit AVR, to see how it handles code density. Unless you *have* to implement a 16-bit ISA, there are plenty of well-designed 32-bit ISAs designed from the ground up. e.g., PowerPC, MIPS, etc. — Brett Hale, May 17 '16 at 13:38
Also - just because you have named registers, doesn't mean an architecture isn't free to use register files internally. That's what modern x86[-64] does. — Brett Hale, May 17 '16 at 13:44
Uniform 3-register operations can certainly simplify compilation significantly. More registers would potentially (on some subset of programs) reduce spilling, but you'd still have to do register spills anyway. 2-reg compilation is significantly more difficult and may involve more scratch registers and therefore more spilling, so the tradeoff is questionable. I'd stick to 3-address instructions. — SK-logic, May 17 '16 at 14:19
P.S., if you really want more registers, consider using register file windows. It only adds a bit of complexity to the register allocation. — SK-logic, May 17 '16 at 14:30
@BrettHale: pretty much every high performance out-of-order design renames the architectural registers onto a larger physical register file. [Wikipedia says POWER1 was the first microprocessor to do it, in 1990](https://en.wikipedia.org/wiki/Register_renaming#History). MIPS and Alpha also had renaming early on. e.g. Alpha 21264 renamed 32 architectural integer registers onto 80 physical regs. Renaming lets you break dependency chains when code reuses the same register with a write-only instruction. It's prob. not worth renaming if you don't implement OOO execution though. — Peter Cordes, May 17 '16 at 16:12

score 4 · Accepted Answer · edited May 23 '17 at 12:30

16 registers with non-destructive 3-operand instructions is probably better.

However, you should also consider doing something else interesting with those instruction bits. For homebrew, you probably don't care about reserving any for future extensions, and don't want to add a ton of extra opcodes (like PPC does).

ARM takes the interesting approach of having one operand to each instruction go through the barrel shifter, so every instruction is a "shift-and-whatever" instruction for free. This is supported even in "thumb" mode, where the most common instructions are only 16 bits. (ARM mode has the traditional RISC 32bit fixed instruction size. It dedicates 4 of those bits to predicated execution for every instruction.)

I remember seeing a study on the perf gains from doubling the number of registers in a theoretical architecture, for SPECint or something. 8->16 was maybe 5 or 10%, 16->32 was only a couple %, and 32->64 was even smaller.

So 16 integer registers is "enough" most of the time, unless you're working with int32_t a lot, since each such value will take two 16 bit registers. x86-64 only has 16 GP registers, and most functions can keep a lot of their state live in registers pretty comfortably. Even in loops that make function calls, there are enough call-preserved registers in the ABI that spill/reload often doesn't have to happen in the loop.

The gains in code size and instruction count from 3-operand instructions will be bigger than from saving the occasional spill / reload. gcc output has to mov all the time, and use lea as a non-destructive add / shift.

If you want to optimize your CPU for software-pipelining to hide memory load latency (which is simpler than full out-of-order execution), more registers are great, esp. if you don't have register renaming. However, I'm not sure how good compilers are at static instruction scheduling. It's not a hot topic anymore, since all high performance CPUs are out-of-order. (OTOH, a lot of software that people actually use is running on in-order ARM CPUs in smartphones.) I don't have experience trying to get compilers to optimize for in-order CPUs, so IDK how viable it is to depend on that.

If your CPU is so simple that it can't do anything else while a load is in-flight, this probably doesn't matter. (This is getting really hand-wavy because I don't know enough about what's practical for a simple design. Even "simple" in-order modern CPUs are pipelined.)

64 registers is getting into "too many" territory, where saving/restoring them takes a lot of code. The amount of memory is probably still negligible, but since you can't loop over registers, you'd need 64 instructions.

If you're designing an ISA from scratch, have a look at Agner Fog's CRISC proposal and the resulting discussion. Your goals are very different (high performance / power budget 64bit CPU vs. simple 16 bit), so your ISAs will of course be very different. However the discussion may get you to think of things you hadn't considered, or ideas you want to try.

Very interesting to see Fog distill his knowledge into an architectural concept. Be nice if he could formalize it to the point where simulators could be realized, like Knuth's MMIX. Along with cache / debug / fault registers, etc. It still appears to lack a definitive document... — Brett Hale, May 17 '16 at 16:38
@BrettHale: I haven't looked over the current version of the proposal. One of the recent posts on the discussion thread was that Agner is working on assembler and simulator support for it and stuff like that, but that he doesn't have much time to spend on that work. x86 might not last forever, and it would be really need if an "open source" architecture with vectors designed in from the start took over. — Peter Cordes, May 17 '16 at 16:40

score 2 · Answer 2 · answered May 17 '16 at 15:17

Regarding the amount of registers, in general I think most C can compile to good efficient machine code when only 16 general purpose registers are available (like AMD64). However, it might be beneficial to have a couple of registers dedicated for function arguments and some marked as volatile - meaning they can be used inside any function but could be clobbered by any called function. Increasing to 32 registers might be beneficial, but I doubt a lot will improve if you'd have 64 general purpose registers for a regular 16-bit CPU. You will have to save the original content of most registers you are going to use in your C function to the stack anyway. Limiting a function to only use 7 registers simultaneously (rather than 37) might still be more (stack) efficient to a C compiler, even when there are a lot more registers available.

A lot depends on the C calling convention you will be using. Which registers are to be used to pass values from caller to callee, which registers are to be considered volatile, what is the cost of pushing to/popping from the stack, etc. You might win more by using a Register Window for managing your registers and stack usage across function calls. Sun Sparc for example has a register window of 8 completely "local" registers, 8 registers that are shared with the caller and 8 registers that will be shared with any callee function. (Furthermore 8 global registers can be addressed as well.) That way you don't have to worry about pushes to the stack, there will always be a single push of 16 registers for every function call simultaneously to changing the execution pointer and a 16 register pop for every return. Intel ia64 has something similar but with a configurable register window size.

However, SUB C,A,B only has a slight advantage over SUB A,B when preserving intermediate results is really important (A needs to be preserved often) and a simple register to register copy is considerably expensive. This seems unlikely in most cases.

And will you be using separate floating or fixed point registers?

Which is more useful at an assembly level, 64 registers or three operand instructions?

2 Answers2