Did any CPU ever expose load delays?

Question

There have been CPUs with exposed branch delays, such as early MIPS: What was the first CPU with exposed pipeline?

(Later MIPS kept the delay slots from the early MIPS, though by that time, it wasn't about exposing hardware pipeline stages – which would have increased the number of delay slots – but keeping compatibility with the early ones.)

Did any CPU ever expose load delays? So that if you try to load a register from memory, then use the contents of that register, instead of pausing until the new value is loaded, for the next N clock cycles, you just get the previous value that was in the register?

Does overclocking count? In early version for the 65816 processor when overclocked past 4mhz or so, the m/x processor status bits didn't fully propagate before the instruction finished. Since they affected instruction decoding, that generally resulted in a crash. Some people did throw in a NOP to give it time to finish. Later models were redesigned to eliminate the problem. — Kelvin Sherlock, Jan 10 '21 at 01:28
@njuffa Not so much delay slots in the MIPS sense; rather the operations using the results of the loads had to be scheduled to account for load delays, otherwise the performance would suffer substantially. What rwallace is asking about is lack of interlock, and in presence of data cache having no load interlocks is nonsense, as the load delay could not be predicted. — Leo B., Jan 10 '21 at 04:34
As a matter of fact, violating the load delay slot requirement did not guarantee the previous value in the register; it was undefined behavior: if there was an external interrupt just after the load instruction, it was going to finish, and the next instruction would see the new value. — Leo B., Jan 10 '21 at 04:44
IIRC the Mill (which unfortunately doesn't seem to have made progress lately) exposes load delays in the sense that the final-stage compiler/loader is aware of them, and can move load instructions forward to spread the delay over other code, but it still will stall when accessing the loaded value early, and not read the previous value instead. — dirkt, Jan 10 '21 at 06:49
@LeoB. I understood the question alright (lack of interlock) but could not clearly remember the details thirty years later ... — njuffa, Jan 10 '21 at 07:53
@dirkt "Moving load instructions forward" sounds like what every out-of-order CPU does (such as the one in your PC), if you consider that there's no difference between moving one instruction forward, and moving other instructions back. — user253751, Jan 10 '21 at 13:53
@dirkt This doesn't need a 'new' CPU. It's a behaviour all pipelined CPUs show since the 1970s. After all, wait cycles for data load/finish are only inserted with the next instruction waiting for that data. so reordering is a thing for compilers since a good 40+ years. (BTW, 'The Mill' is a 6809 card for the Apple II :)) SCNR) — Raffzahn, Jan 10 '21 at 14:42
@user253751 Even OoO execution does benefit from compilers generating code with less dependencies between successive instructions. — Raffzahn, Jan 10 '21 at 14:46
@Raffzahn Please watch this, it looks like you have completely misunderstood what I was talking about. — dirkt, Jan 10 '21 at 15:05
@dirkt: I thought of The Mill as well, while editing Raffzahn's answer to describe what I recall reading the R2000 did if you violate the load-delay. There's now a link in the answer to https://millcomputing.com/topic/introduction-to-the-mill-cpu-programming-model-2/ - the section on Loads describes it pretty well. — Peter Cordes, Jan 10 '21 at 17:50
@Raffzahn okay but nobody was talking about compilers in this thread? dirkt said the Mill CPU could do it by itself. I pointed out that all OoO CPUs can. — user253751, Jan 10 '21 at 21:23

score 15 · Accepted Answer · edited Jan 10 '21 at 19:40

15

Yes, it happened. MIPS I (R2000/R3000) did suffer from Load Delay Slots. In practice on R2000, on cache hit the next instruction would typically see the old value of the load-result register; on cache miss it would see the load result. So you couldn't usefully take advantage of it and there were no on-paper guarantees of anything, unlike on the Mill (*1).

Other than MIPS, the decision to use load delay slots was never an issue. Mostly because inserting a delay does not really save much hardware. Stalling the pipeline only needs one (or two) rather simple comparator checking if the previous target is used as source in the next instruction.

Detecting dependencies by adding a little logic can be used to kill two birds with one stone as any code, independent of ordering will run, while at the same time 'intelligent' instruction ordering can be used to take advantage of the load delay.

This method has already been used by (e.g) /360 in the 70s, and assembly programmers did take care thereof (or not, depending on their quality).

For RISC CPUs this logic can, for all practical purpose, be seen like an automatic 'insertion' of a NOP. Something otherwise the compiler had to do whenever it couldn't find an instruction to move into the load slot.

A simple addition like A+=B; would create at least two load slots (plus maybe a store slot) which more often than not can't be filled:

    L   R1,VAL1
    NOP 
    A   R1,VAL2
    NOP
    ST  R1,VAL1

This is by no means a rare kind of code - programs are full of such lines.

Bottom line: Load slots bring almost no improvement but add considerable code bloat. (*2)

The whole situations of load delays differs a lot from branch delays, as the load delay happens to data access in linear code execution, not stalling code fetch. In contrast a (conditioned) jump instructions always needs at least one additional code fetch cycle that can't be avoided - or more exact discard the already fetched next instruction without using that cycle for operation. A branch delay slot means nothing else that this already fetched instruction gets executed. This situation does not exist in linear, non-branching code.

So one can just speculate why the R2000 had that load delay slot. IMHO they tried to make it more simple than it has to be. They even named the ISA after this fact: MIPS = Microprocessor without Interlocked Pipeline Stages means it doesn't stall for anything, except unpredictable long stalls (cache miss/RAM access).

Later, when the load delay slot was removed (MIPS II / R6000), existing R2000/R3000-compatible code was full of NOPs in cases where the compiler (or optimizing assembler) couldn't fill the load-delay slot with useful work. These cases were not as rare as MIPS architects may have hoped. It was not helping anything, only bloating the code.

*1 - Unlike the Mill architecture where a load instruction can indicate how many instructions later the result should be ready, allowing use of the old value for up to a few instructions, and reuse of the address register. The Mill is still a paper architecture, though, not a CPU that actually existed.

*2 - Same is as well true for branch slots, except they happen less often.

edited Jan 10 '21 at 19:40

Jean-François Fabre

10,805
1
35
62

answered Jan 09 '21 at 23:14

Raffzahn

222,541
22
631
918

Right. The advantage of a load delay would be not so much saving the little bit of hardware to implement the interlock, as to improve IPC by making it possible for the CPU to do other things while waiting for the load to complete. (An out of order CPU can also do that, but the out of order hardware is quite expensive, whereas the load delay would have slightly negative cost.) – rwallace Jan 09 '21 at 23:25
Of course the disadvantage is that as soon as you implement the next CPU of the same architecture – or even a clock speed bump of the same CPU – the number of delay cycles the hardware wants, no longer matches existing code, so you get a mess. Still, it could make sense for some embedded applications, or game consoles. – rwallace Jan 09 '21 at 23:27
2

@rwallace Not really. Especially in embedded the cost of a load delay slot is even more, as there are many situations were the slot can't be filled with a useful instructions forcing the inclusion of an explicit NOP resulting in bloated code. Not really something embedded developers love to see. – Raffzahn Jan 09 '21 at 23:33
2

I think I recall the MIPS 2000 assembler reordering instructions to populate the load-delay slot (it certainly did for branch-delay slots). – dave Jan 10 '21 at 00:15
1

MIPS originally stood for Microprocessor without Interlocked Pipeline Stages (What is an "interlocked pipeline" as in the MIPS acronym?). It had to be able to stall for cache misses, and in some cases reading lo/hi mult results, but original MIPS I (R2000 / R3000) couldn't stall for "normal" stuff. This may have simplified more than just another comparator? That's partly why mult and div put their results in special registers with restrictive rules about reading them, although that might just have been to reduce comparators. (ping @rwallace) – Peter Cordes Jan 10 '21 at 15:20
2

R3000 was also MIPS I. It was MIPS II that removed the load delay slot (apparently R6000 was the first CPU of that ISA revision: https://www.linux-mips.org/wiki/Instruction_Set_Architecture#MIPS_II. R4000 was MIPS III. Out-of-order naming of in-order pipelines... hmm.) – Peter Cordes Jan 10 '21 at 15:26
1

I made a relatively large edit; you might want to take out some of it if you think it's too much of a tangent, or of course rearrange it. – Peter Cordes Jan 10 '21 at 15:59
@PeterCordes Thank you. I reduced the too specific sidenotes a bit. – Raffzahn Jan 10 '21 at 17:04
@PeterCordes It is really just a set of comperators (or multiple with longer pipelines, but the MIPS I one wasn't such). The special handling of Mult/Div is result of a schizophrenic situation. On one hand, the the goal was to create a true single cycle instruction CPU, but also wanting to include Mult/Div as basic instructions, even though they do not fit that scheme. RISC, and most notably MIPS in its early stages, had more common with a cult preaching its sermon than engineering. The dogma is absolute, they rather shoot themself in the foot than acknowledging that real life is different. – Raffzahn Jan 10 '21 at 17:17
1

Yeah, RISC philosophical purity taken to extremes. At least MIPS has the excuse that it literally started as an academic project to test the validity of the "RISC = good" hypothesis. Some later machines made engineering decisions while drinking more or less of the kool-aid. (e.g. ARM takes the good ideas but maintains high code density.) – Peter Cordes Jan 10 '21 at 17:42
It doesn't really make sense to say that The Mill model allows reuse of the address register or continued use of the old value, as The Mill is not a register machine, so you aren't loading the value into a slot in the same way. It does mean you can allow the address value to fall off the belt, and that the loaded value isn't pushed on for a certain number of cycles, but how that impacts what you can do in the assembly is very different as it impacts everything on the belt, not just two registers. – user1937198 Jan 10 '21 at 21:44
@user1937198 it doesn't matter if one calls it 'belt' or not, when implemented it's a register and fetching content to fill this will take time (BTW, it is not implemented as a belt, but as a register file. The 'movement' is done by constant renaming). – Raffzahn Jan 11 '21 at 21:22
@Raffzahn The key difference is that in a register machine other instructions executed during the delay don't have to effect the address register. In the belt design the only way to keep the address in the same place on the belt would be to execute no-ops, and in the case of a delay of more that a couple of cycles actively keeping the value on the belt. And, no its not implemented as a register file, its expected to be implemented as a vector of shift registers, because every value has a maximum lifetime. (Unless you have a source other than https://millcomputing.com/docs/belt/?) – user1937198 Jan 11 '21 at 22:04
And with regards to use of the old value of register, the concept of a belt means that all mill instructions have the property of not replacing the previous value, but replacing an arbitrary value of whatever is at the end of the belt at the point of retirement. So how do you say which is the 'old' value on the belt? – user1937198 Jan 11 '21 at 22:09
I think the big difference between the usefulness of load-delay slots and branch-delay slots stems from the fact that a processor core will typically have many registers, but one logical program counter. When performing a delayed branch, the instruction(s) immediately following the branch effectively behave as though they used a program counter separate from the one loaded by the branch, allowing them to be processed usefully even though the program counter being written by the branch will effectively be "busy" for a cycle or two. – supercat Jan 11 '21 at 23:17

Did any CPU ever expose load delays?

1 Answers1