17

I have a basic question about assembly.

Why do we bother doing arithmetic operations only on registers if they can work on memory as well?

For example both of the following cause (essentially) the same value to be calculated as an answer:

Snippet 1

.data
    var dd 00000400h

.code

    Start:
        add var,0000000Bh
        mov eax,var
        ;breakpoint: var = 00000B04
    End Start

Snippet 2

.code

    Start:
        mov eax,00000400h
        add eax,0000000bh
        ;breakpoint: eax = 0000040B
    End Start



From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Cam
  • 14,930
  • 16
  • 77
  • 128
  • 4
    I'm not an expert (hence only a comment), but as registers are CPU internal, there are faster than memory operations. Faster not by a small margin of a mere 1000x, but faster to a significant degree. – Michael Stum Mar 02 '10 at 05:00
  • 4
    Some (typically RISC) architectures *don't* have arithmetic instructions that operate directly on the contents of a memory address, because in reality they're used relatively rarely. (x86, of course, has them, as well as every other possible strange feature, because it's crazy old x86.) – bobince Mar 02 '10 at 07:26
  • note it is not uncommon to implement the registers in a register file which is another term for an sram. the registers themselves are just an on chip sram that has an address and data bus and control signals. It is as fast as the processor can go though, and see the answer below, ram as in off chip or off processor core ram, is often very very slow, particularly dram. – old_timer Aug 09 '17 at 17:19
  • there are processors that are mostly ram based, stack based processors basically think small c or pascal, java, python, stack based virtual machines but there are real machines built with similar instruction sets. – old_timer Aug 09 '17 at 17:20

11 Answers11

29

If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:

 CPU registers (slightly complicated, but in the order of 1KB per a core - there
                are different types of registers. You might have 16 64 bit
                general purpose registers plus a bunch of registers for special
                purposes)
 L1 cache (64KB per core)
 L2 cache (256KB per core)
 L3 cache (8MB)
 Main memory (8GB)
 HDD (1TB)
 The internet (big)

Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.

There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).

The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.

One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!

David Johnstone
  • 24,300
  • 14
  • 68
  • 71
14

Because RAM is slow. Very slow.

Registers are placed inside the CPU, right next to the ALU so signals can travel almost instantly. They're also the fastest memory type but they take significant space so we can have only a limited number of them. Increasing the number of registers increases

  • die size
  • distance needed for signals to travel
  • work to save the context when switching between threads
  • number of bits in the instruction encoding

Read If registers are so blazingly fast, why don't we have more of them?

More commonly used data will be placed in caches for faster accessing. In the past caches are very expensive so they're an optional part and can be purchased separately and plug into a socket outside the CPU. Nowadays they're often in the same die with the CPUs. Caches are constructed from SRAM cells which are smaller than register cells but maybe tens or hundreds of times slower.

Main memory will be made from DRAM which needs only one transistor per cell but are thousands of times slower than registers, hence we can't work with only DRAM in a high-performance system. However some embedded system do make use of register file so registers are also main memory

More information: Can we have a computer with just registers as memory?

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • That's about as complete an answer mixed with history as could be wanted. – David C. Rankin Aug 09 '17 at 04:25
  • 5
    Having a small number of always used memory locations such as registers is also good to keep down the instruction size - the original 8 x86 GP registers could be indexed in 3 bits, compare with specifying each time a memory address. OTOH, this is just one possible compromise; if you don't mind going slow (or, if your CPU is as slow as the RAM anyway) there are other possibilities. Take the 6502: it effectively has 1 8 bit accumulator and 2 index registers - period; but it has compact form of memory access instructions to the first 256 RAM locations, which can then be used a bit like registers. – Matteo Italia Aug 09 '17 at 06:00
  • @MatteoItalia: related: AVR (RISC microcontroller with 32 8-bit registers) has its registers [aliased to memory locations (in its internal SRAM)](http://www.avr-tutorials.com/general/avr-memory-map). (Also described [as part of this silly question](https://stackoverflow.com/questions/26915731/cpus-with-addressable-gpr-files-address-of-register-variables-and-aliasing-bet)). That probably reflects the internal implementation, but a high-perf implementation would be possible with a separate fast register file, and detecting memory reads/writes to those addresses and handling with a fallback. – Peter Cordes Aug 10 '17 at 02:16
  • A "register file" doesn't mean registers are part of main memory! The SRAM for the register file is separate from the cache/memory subsystem except in rare cases like AVR where registers are actually mapped into memory. For example, [Intel Sandybridge uses a 160-entry integer physical register file and a separate 144 entry FP physical register file](http://www.realworldtech.com/sandy-bridge/5/), with register-renaming to map architectural registers onto those physical registers. David Kanter's write-up describes the change from Nehalem, which kept OOOe result in the ROB. – Peter Cordes Aug 10 '17 at 02:24
  • I think one of the more important differences between registers and memory is that indexed addressing means the location isn't always known right away. So forwarding over bypass networks can't always work as well. I wrote an answer discussing that: https://stackoverflow.com/questions/2360997/assembly-why-are-we-bothering-with-registers/45603798#45603798 – Peter Cordes Aug 10 '17 at 03:15
  • 1
    @PeterCordes: now that you make me think about it, I wonder for how many low-end MCUs registers are significantly faster than main memory, given that they all generally have little embedded SRAM (possibly similar/the same as the one of the register file?) as main memory and don't do particularly smart optimizations such as register renaming. Maybe using the same type of memory is common, but just AVR exposes it so explicitly? – Matteo Italia Aug 10 '17 at 05:49
  • 1
    @MatteoItalia: I wouldn't be surprised if loads/stores had only 1c latency on hardware with built-in SRAM, but EEPROM / flash is much slower, as I understand it. Of course registers are still important even when you have fast RAM, because you need a pointer in a register before you can dereference it. (As you pointed out, this is how register machines keep their machine-code encoding simple and/or compact.) – Peter Cordes Aug 10 '17 at 05:57
  • Hmm, *reading* builtin flash should be decently fast, given that it's used as memory for instructions. Actually, looking at AVR docs, I see that `mov` (reg <- reg) and `ldi` (reg <- imm) is 1c, `lds` (reg <- mem) is 2c (notice that it cannot go straight to SRAM, as "mem" may can refer to the whole data space, which includes also be IO ports; SRAM-only `lds` for ATtiny 10 is 1c), `lpm` (reg <- flash) is 3c. Given that AVRs in general have quite low clock speed anyway, I think that those latencies are more due to slow internal decoding/routing logic than speed limitations of the memory. – Matteo Italia Aug 10 '17 at 06:24
  • (all these are for *direct* loads, although interestingly indirect loads seem to take 2c as well; stores are the same, except for flash, which says "depends on operation", requires rewriting whole pages and is not available on many devices) – Matteo Italia Aug 10 '17 at 06:32
10

Registers are much faster and also the operations that you can perform directly on memory are far more limited.

Tronic
  • 10,250
  • 2
  • 41
  • 53
5

In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.

The advantage of registers are following:

  1. They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.

  2. Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.

  3. Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)

  4. In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.

  5. Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.

Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.

Netch
  • 4,171
  • 1
  • 19
  • 31
4

x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine1. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.

https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)

Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.


Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.

Current x86 CPUs can read/write many more registers per clock cycle than memory locations.

For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).

Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.

Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).

These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)

As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.


See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)

Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.

Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.

See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).


The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)

Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.

The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Minor nitpick: The short form of "kilo binary bytes" or "kibi bytes" is "KiB" with a capital K. – ecm Dec 08 '20 at 20:32
  • @ecm Really? That looks silly / weird to me, but [wikip](https://en.wikipedia.org/wiki/Kibibyte) confirms you're correct. Thanks. Ah, apparently there's some history of using just captial K (before the Ki prefix and the ridiculous "kibi" pronunciation was a thing). https://en.wikipedia.org/wiki/Binary_prefix#Main_memory – Peter Cordes Dec 08 '20 at 21:15
3

Registers are accessed way faster than RAM memory, since you don't have to access the "slow" memory bus!

naivists
  • 32,681
  • 5
  • 61
  • 85
1

We use registers because they are fast. Usually, they operate at CPU's speed.
Registers and CPU cache are made with different technology / fabrics and
they are expensive. RAM on the other hand is cheap and 100 times slower.

Nick Dandoulakis
  • 42,588
  • 16
  • 104
  • 136
1

Generally speaking register arithmetic is much faster and much preferred. However there are some cases where the direct memory arithmetic is useful. If all you want to do is increment a number in memory (and nothing else at least for a few million instructions) then a single direct memory arithmetic instruction is usually slightly faster than load/add/store.

Also if you are doing complex array operations you generally need a lot of registers to keep track of where you are and where your arrays end. On older architectures you could run out of register really quickly so the option of adding two bits of memory together without zapping any of your current registers was really useful.

James Anderson
  • 27,109
  • 7
  • 50
  • 78
0

Yes, it's much much much faster to use registers. Even if you only consider the physical distance from processor to register compared to proc to memory, you save a lot of time by not sending electrons so far, and that means you can run at a higher clock rate.

Rob Lourens
  • 15,081
  • 5
  • 76
  • 91
0

Yes - also you can typically push/pop registers easily for calling procedures, handling interrupts, etc

Jeff
  • 1,969
  • 3
  • 21
  • 35
-2

It's just that the instruction set will not allow you to do such complex operations:

add [0x40001234],[0x40002234]

You have to go through the registers.

Nicolas Viennot
  • 3,921
  • 1
  • 21
  • 22
  • 1
    There are lots of CPU architectures that will permit exactly those kinds of instructions. The issue is speed, not what operations are permitted. The limited operations come about because nobody in their right mind would do them RAM to RAM anyway. – JUST MY correct OPINION Mar 02 '10 at 05:32
  • 1
    The question was using IA32 instruction set. And in IA32, It doesn't exist. You just cannot do it. – Nicolas Viennot Mar 02 '10 at 16:28