66

Thinking about machines such as the PDP-7, PDP-10, CDC-160 and so on1, these are machines which do not have a stack pointer which can quickly and easily change to point to a new stack frame, like the x86 or ARM machines we have today.

The C standard does not say anything about a stack, since C actually doesn't need one. I can see that local variables may in most cases be placed somewhere in memory for most functions. The functions for which that doesn't work are the recursive ones and the mutually recursive ones, (which cannot be converted to iterative routines by an optimisation stage). But the linker won't know anything about that, will it?

If there is a known compiler which does anything other than use or implement a stack, my question is to see a description of what it does "instead", ideally with a link to source code.


1: Yes, the Parallax Propeller also fits in this category of machines, and I'm sure many microcontrollers do, and so this question is not purely about Retrocomputing. But still I think the question belongs here because early C porting efforts will have the answer to the question.

Omar and Lorraine
  • 38,883
  • 14
  • 134
  • 274
  • 1
    Comments are not for extended discussion; this conversation has been moved to chat. – Chenmunka Aug 08 '18 at 14:26
  • 1
    MIPS for example doesn't have any hardware stack support, and the stack is manipulated manually by load/store relatively to a pointer. So the title is actually incorrect or at least unclear. It should be something like "... had no indirect load/store instruction" since if you can store a memory address and load from that you can have a stack – phuclv Aug 09 '18 at 02:48
  • Just to make sure: by "hardware stack" you're just talking about a stack pointer, not something like a stack machine? – rakslice Apr 29 '19 at 21:01

10 Answers10

76

How was C ported to architectures that had no hardware stack?

Simply by implementing a dynamic memory management for subroutine calling, parameter passing and local memory.

If there is a known compiler which does anything other than use or implement a stack,

Now this is a completely different question than the one made in the headline.

Implementing a stack isn't anything special to porting C, nor is a stack at all an invention of C. C inherited its usage from ALGOL, which was first implemented (*1) by Friedrich Bauer and Klaus Samelson (among others) - the same men who 'invented' the stack in 1955 (*2) and got a related patent in 1957.

A CPU stack as we know it today (and is used by C) is just a very primitive implementation of the stack principle defined by them. It's a variant optimized for subroutine return address handling - not necessarily good for parameter passing or local variables.

All languages that allow local variables and/or reentrant subroutine calling (*3) use some form of stack (*4). This includes such dinosaurs as Fortran or COBOL (*5). Bauer was eventually the most influential single person in early compiler development, writing several papers that defined the foundation for almost every language developed thereafter.

In the beginning the stack was mainly seen as a software issue: a rather complex data structure that needs to suit the language, nothing to be supplied by the hardware. How much this is true can be illuminated by the /360 design. While designed entirely after the stack had been introduced in programming languages and finished in 1965, it does not feature a stack pointer or any auto-indexing instructions that could be used for a stack. The hardware developers didn't think it would be of any good to implement such, especially as it would have been a quite complex instruction (*6), considering the different ways stacks were handled by different languages.

my question is to see a description of what it does "instead", ideally with a link to source code.

In addition, focusing on the stack here is in fact not just about the data structure, but the wider area of calling conventions. So let's split this up a bit between storage and parameter passing.

The /360 is a prime example for a widely used architecture without, so let's use it.

Local Storage for Register Saving and Variables.

Static Storage

The closest that comes to an 'instead' are maybe calling convention used on IBM /360. They were ubiquitous and survived with assembler code way into the 1980s.

Here each subroutine also had a piece of private memory where all register content of the caller (*7) as well as local variables were stored. The register storage area was traditionally called SAVEAREA. There was no separate handling of the return address, as it is stored in one of the /360 registers. On return the registers were restored and program flow was transferred back to the caller. As a side effect, in these implementations all local variables are static and available for the next program run.

Stack Oriented Storage

(To understand this, it is useful to remember that the narrowed down version of a hardware supported word (or byte) orientated stack is just a special case of the general definition.)

Another, more elaborated was the usage of a stack. For example, the COLUMBUS implementation (*8). Here a register, usually R13, called R@STACK, was pointing to the frame of the actual nesting level. It had to be word aligned (*9). The first (lowest) 16 words (64 bytes) where kept free as storage for the register set of the next level, again called SAVEAREA, while everything above was used for local variables (*10). The stack was handled by the callee, not the caller. Thus a subroutine (function) call was simple:

L    R15,=V(FUBAR)           * Linkable address of the function to be called
BALR R14,R15                 * Subroutine call (*11)

The subroutine's entry code depended on the way stack underflow was handled. For this example we assume the usage of guard pages (*12) to issue an addressing error when exceeding the available memory. So everything to be done is saving the caller's register content and establishing a new stack frame:

STM  R14,R13,0(R@STACK)      * Store all registers in the save area (At R13)
SH   R@STACK,=Y(LLOCAL+64)   * Move the stack register down the the amount
                             * needed local storage plus room for a new SAVEAREA

Now the subroutine follows.

At the end, all cleanup is done automagically by restoring the registers of the caller:

LM  R14,R13,LLOCAL+64(R13)   * Restore registers (*13)
BR  R14                      * And return

And we're back.

Parameter Passing

Static Procedures

Conventions for parameter passing and return codes where rather sparse on this and next to every program made up their own variations. In general R0 and R1 where used for parameter passing, often pointing to parameter blocks (*14), then again it was quite common to have a different list of parameters loaded into many registers with no structure at all.

Similar, there were no conventions about parameter returns. Often R15 was used, but not really consistent.

Dynamic, Stack Using

While the COLUMBUS package did introduce a stack to assembler (and other languages), its calling convention was not necessarily stack based. In general, parameters where passed to a function as a memory reference. It was up to the programmer to build up the parameter list (*15), and pass its address in R1.

L     R1,PARLIST_FOR_FUBAR

The called function now can move it wherever needed or address the parameters relative to R1.

Much like the stack frame itself, the parameter list was free form. If it was dynamically build, one would usually reserve some stack space (local variable) to hold it. If static, it could as well be located in some constant area (unless it includes a return value). Or as a combination of both - having a preformatted list in a constant area and move it over to the stack when needed and modified before calling

Not relying on a push/pop mechanic offered a lot more flexibility and performance. In many function calls several, or maybe even all parameters are constant. With preformatted parameter blocks, only real variable parameters need to be stored at their position, reducing the amount of needed instructions thus runtime. The most high performance instruction is still one that doesn't get executed at all :))

If the function was supposed to give a return code, this had to end in R15 of the caller - which was simply done by storing the return code into the save location of R15 within the SAVEAREA. It's like having an automatic declared local return code variable that can be manupulated thruout the function and will be automatic loaded on return without any additional space requirement or runtime cost :=)


Remarks:

In real programming the assembly programmer never touched these instructions, as all was covered in macros. A program's structure looked more like this:

@ENTR TYPE=M,NAME=MAIN
...
@PASS NAME=FUBAR,PLIST=(FCB,RECORD)
@IF   NE
LTR   R15,R15         * Return Code Not Zero?
@THEN
@EXIT RC=#UNSATISFIED * Finish Program with Return Code UNSATISFIED
@BEND
...
@EXIT RC=#HAPPYENDING * Finish Program with Return Code HAPPYENDING
@END

@ENTR TYPE=L,NAME=FUBAR,PLIST=(FCB,RECORD) ... (do whatever needed with FCB and RECORD) ... @EXIT RC=0 * Finish Function with Return Code Zero @END

And yes, that's assembly language - real one that is :)))


*1 - First named ZMMD on a Zuse Z22 in 1958

*2 - Back then called Kellerspeicher, which may be literally translated as basement-storage. I always guess an American would have called it rather attic-storage as they fill the same niche in the US, where basements are way less common.

*3 - Meaning the same subroutine can be called more than once without prior exiting it.

*4 - With only static variables and non-reentrant calling conventions it is possible to come up with calling conventions using static memory. Here each subroutine gets its own memory to save registers and handle local variables - usually as part of the program memory, right after the code. Which is how FORTRAN or COBOL started out, before quite early switching for stack usage.

*5 - Around 1980 Siemens implemented two different sets of stack instructions in their /370 compatible mainframes (X-CPU). Nothing like a simple wordwise PUSH/PULL and one of them more like hardware implementations of linked list.

*6 - As comments by James Large and T.E.D. have underlined, the situation is a bit more complex, especially with FORTRAN. Recursion and thus reentrant enabled functions and non static local variables (i.e.on a stack) where not part of the language definition until FORTRAN 90 came along. In fact, many programs did rely on function variables to be static between calls. FORTRAN further issued a compile time error if a function was called within itself.

But even before FORTRAN 90 there were compilers that did allow recursion and dynamic allocated variable space. Just not as standard.

Then again, at least since FORTRAN 77, recursion was possible by using function pointers. Here the compiler couldn't check the calling tree and no run time checks did hinder the usage. I have a bookmark of a nice page describing how it was done and what a programmer had to keep in mind.

*7 - At least the ones that were to be overwritten.

*8 - There were others - after all, software is ... well ... soft and allows many ways of implementation.

*9 - At least. Aligning to doublewords of better was standard - and there's a nice story how alignment here did improve performance in a macroscopic way - but that's a different story.

*10 - Depending on the configuration another 32 bytes were reserved for debugging information. In this case the entry consisted of a runtime call where frame information was checked, the new frame was build and debug information like the name of the called function was stored in the additional fields. Having function names as readable EBCDIC at the begin of each stack frame was just great when reading a dump.

We had this feature enabled by default, as it also allows a function to return to some known caller within the calling sequence, bypassing all intermediate ones. Extremely handy in error handling. That way, one doesn't have to schlep some error from a function way down through all the whole calling sequence, with (more or less) specific error handling in each of them.

*11 - Well, BAL R15,FUBAR could as well, if FUBAR is within the address range of the current module. In everything but small demo programs it isn't.

*12 - That's simply one read protected page above and below the available stack page. Accessing this as in case of over- or underrun will result in an address violation which in turn can be handles by the runtime according to what is intended. Like extending the stack memory, or cancelling the program run, or whatsoever.

*13 - This implementation allows only a maximum stack size of 4 KiB minus 64 bytes. Which is natural as stack addressing based on the stack register can only address 4 KiB anyway. There are other implementations that allow more stack, but then again will also require more complex addressing. If a procedure needs that much memory it is more useful to get it from the heap anyway.

*14 - Like having R0 point to a FCB and R1 to a record buffer when doing a READ or WRITE call for a file.

*15 - There where support functions for building parameter lists from an argument list, but that doesn't change the basic workings, so let's skip that for simplicity.

Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 4
    Re, "This includes such dinosaurs as FORTRAN..." FORTRAN is a family of languages. I spent some time writing programs in FORTRAN II, which allowed me to declare variables with local scope, but it statically allocated function activation records (i.e., using the System/360 calling convention that you describe above.) Recursive calls and other reentrant calls were not possible. – Solomon Slow Aug 06 '18 at 18:26
  • 4
    About #3 and 4: What is written is technically correct with the caveats given. However, Fortran is of a similar vintage as Algol and the stack itself (53-57), and thus initially did not assume or typically use a stack, and indeed its subroutines were generally not reentrant. Due to the extreme importance of backward compatibility, I'm not sure there was official support in the language until Fortran 90. – T.E.D. Aug 06 '18 at 18:29
  • You're both right (@jameslarge , T.E.D.), it's a complex issue with a lot of variables (sic!) . Fortran didn't directly allow recursion before FORTRAN90, still recursion was possible by using function pointers. More important here, there where compilers prior to FORTRAN90 that did allow the use of reentrand functions and stack managed variables. Just not standard. – Raffzahn Aug 06 '18 at 18:50
  • @jameslarge: From what I've seen, newer versions of Fortran seem like they would be superior to "only-define-behaviors-mandated-by-the-Standard C" for many purposes. I've not used any version of Fortran since FORTRAN-77, but Fortran seems to have been becoming suitable for more purposes while "only-define-behaviors-mandated-by-the-Standard C" has gone the other way. – supercat Aug 06 '18 at 21:26
  • ZMMD appears to have been an implementation of IAL (Algol 58), not Algol 60. Since there was no block structure and no recursive procedure activation (that was an Algol 60 innovation), it's not clear why ZMMD would require a stack. – dave Nov 09 '18 at 13:28
  • @dave not sure what your point is. Read again and you'll notice that nothing you note has been claimed. Also, IAL had blocks, as well as it did not deny the recursion for any implementation. In fact, Samelsons 1958 IAL report already includes recursion. – Raffzahn Nov 09 '18 at 14:42
  • 1
    I don't think runtime stack was needed for IAL programs - all storage could be static. Since then, I read Bauer's oral history transcript (now at CHM) and I think he was talking about stack-based approaches for compiling expressions. I remain convinced there were no blocks in the Algol 60 sense (compound statements, yes - but declarations within compound statements had procedure scope). However, nested procedure bodies may have sneaked the concept in unwittingly. – dave Nov 09 '18 at 22:44
  • There is no need for a stack to make local scope restricted to a block within a procedure, as these mamory hirarchy does not contain any run time dynamic, only fencing during compile time. But without a stack, no dynamic memory for procedures (which includes all blocks) is possible. To my understandign these are unrelated concepts. – Raffzahn Nov 09 '18 at 23:16
  • I agree no procedure dynamic memory without stack - but there's no requirement that I see that an IAL procedure to do that. It could use static allocation a la Fortran. – dave Nov 09 '18 at 23:34
  • Right, but since procedures scope does need dynamic storage, indices are good it was - I think one of the problems here is - as so often with early documentation - that the vocabulary hasn't been made and writers wheren't ware of the need to specity things that we assume as mandatory today. Part of the fun of reading old books, isn't it? And yes, they could have done it like fortran, but considering that it was made exactly by the same who made the stack a core pice of their contemporaty work makes it more likely to be dynamic, don't you think so? – Raffzahn Nov 09 '18 at 23:53
18

I've seen three common approaches to handling that:

The first approach is to have the compiler statically allocate an area of memory for use as a stack, and maintain a global pointer to the current "stack pointer". On a platform with a subroutine call/return stack that's too small to accommodate expected function nesting, something like:

int foo(int a, int b);
...
foo(1,2);
...
void bar(int);
int foo(int a, int b);
{
  int temp=a+b;
  a>>=1;
  bar(a);
  bar(temp);
}

may be handled, absent optimization as:

    *compilerSP++ = 1; // Push first arg
    *compilerSP++ = 2; // Push second arg
    CALL _foo   // Stores return address in a special register
    compilerSP -= 2; // Remove pushed args
    ...
_foo:
    *compilerSP++ = RETURN_ADDR;  // Get called function
    compilerSP++; // Reserve space for temp
    compilerSP[-1] = compilerSP[-4]+compilerSP[-3]; // temp = a+b
    compilerSP[-4] >>+ 1; // a>>=1;
    *compilerSP++ = compilerSP[-4]; // Push a
    call _bar
    compilerSP--; // Remove pushed arg
    *compilerSP++ = compilerSP[-1]; // Push temp
    call _bar
    compilerSP--; // Remove pushed arg
    compilerSP--; // Remove space for temp
    RETURN_ADDR = *++compilerSP;
    RETURN;

Performance of this approach is generally pretty bad, but it will support recursion and re-entrant functions without difficulty.

A second approach which is not conforming but is generally more useful is to forbid recursion, compute what the worst-case stack usage would be for each function, and then have the linker statically allocate space for all arguments, local variables, and any temporary variables the compiler needed to make things work. Beyond the fact that such an approach greatly improves performance on systems without stacks, it has a second major advantage even for systems with stacks - such a great advantage, in fact, that I think the Standard should have recognized a subset of C without recursion. If recursion is disallowed, programs can be statically guaranteed never to have any kind of "stack overflow" at run-time. If the linker can't produce a static allocation that's guaranteed to work, it will reject program at link time without it running at all. Although some kinds of program need support for recursion and re-entrancy, waiving such support would allow implementations for safety-critical systems to guarantee that combination of inputs could bomb the stack--something that would be useful even on systems that use stacks.

A third approach which is conforming, but still performs better than the first, is to statically allocate space for local variables and arguments whose address isn't taken, but also keep for each function a flag to indicate its level of nested invocations. If the function is being invoked when it is called again, the second invocation should copy all of its associated automatic objects to a compiler-maintained "pseudo-stack" like the one used in the first approach before allowing it to be invoked recursively (variables whose address is taken would need to be allocated on the pseudo-stack whether code is invoked recursively or not). When a function returns after having been invoked recursively, it should restore all saved automatic objects from that pseudo-stack. This approach, unlike the second, conforms to the Standard, but will run slower because of the need to check whether functions are being invoked recursively. Although I've used a compiler (Archimedes V4) that could be configured to employ this approach, I've never enabled it since it adds extra overhead to all code whether or not it uses recursion. The overhead isn't as severe as with the first approach, but unless an application needs recursion, there's no reason to accept any overhead.

supercat
  • 35,993
  • 3
  • 63
  • 159
  • "A third approach which is conforming" i'm not convinced this is conforming, in particular it seems like it would be broken by taking pointers to local variables. – Peter Green Aug 07 '18 at 20:15
  • 1
    @PeterGreen: Hmm... I hadn't thought of that. A compiler could handle that by allocating pseudo-stack space for any automatic object whose address is taken, at the cost of making accesses to that object much slower than they otherwise would be. As I've said, I would have preferred to see the Standard make recursion optional, especially since it doesn't specify any level of nesting for which it has to work without blowing the stack. Recursion support is sufficiently impractical on some platforms that even if an implementation could support it, programmers would avoid using it at all costs. – supercat Aug 07 '18 at 20:33
  • 3
    @PeterGreen: On a platform like the 8051 where I've seen such an approach used, code which is adapted to fit the limitations of the CPU can often be smaller by a factor of 3 or more, and faster by an order of magnitude, than code which is written in more "typical" fashion. Tolerating the limitations of the CPU in exchange for such huge improvements in efficiency seems like a useful trade to me. – supercat Aug 07 '18 at 20:37
  • 1
    @PeterGreen: As a simple example, if x is an automatic char whose address has not been taken, x++ would be inc FUNCTIONNAME?x, a 2-byte 1-cycle instruction. If it were an automatic char whose address had been taken, code would likely be mov a,_stacklo / add #offX / mov dpl,a / mov a,_stackhi / adc #0 / mov dph,a / movx a,@dptr / inc a / movx @dptr,a. 15 bytes and 11 cycles. – supercat Aug 07 '18 at 20:45
  • The approach of static allocation with copying on detection of recursive activation was used by at least one historical Algol 60 compiler (whose name I have inconveniently forgotten). There, of course, there was no issue with pointers. – dave Nov 09 '18 at 13:31
  • I can't believe we're still discussing in the 21st century whether recursive activation should be optional in a programming language. :-) And, of course, with a language that supports modular compilation (i.e., has a link phase), you can't determine at compile time whether something is recursively called. Nevertheless, an implementation is perfectly free to provide #pragma nonrecursive if it thinks it has a customer base that would want it. – dave Nov 09 '18 at 13:38
  • @dave: It seems crazier to me that programs that don't need recursion are processed by tools that can't statically measure or validate stack usage. If a program's call graph is non-cyclic, it's possible to statically allocate all automatic objects in the same amount of storage as a stack would require in worst case if all edges on the graph were reachable and arbitrary combinations of interrupts could be invoked at any point. Unlike when using a stack, however, there would be no need to worry about whether the allocation was larger or smaller than necessary. – supercat Nov 09 '18 at 16:18
  • @dave: Even programs that use recursion could be statically validated if external functions were tagged with their stack requirements and there were a __statically_safe intrinsic which would yield zero unless the system could guarantee that returning non-zero would not cause a stack overflow. The stack space required to safely call a function that starts with int foo(void) { if (!__statically_safe) return signal_error(whatever); ... } would be just over that required to call signal_error even if the main body of the function invoked itself recursively, and thus... – supercat Nov 09 '18 at 16:28
  • ...a compiler could render such a function as int foo(void) { if (__stack_pointer < __LINKER_SUPPLIED_VALUE) return signal_error(whatever); ...} while giving the linker the information necessary to compute the value in question. While "hope for the best" stack semantics may have made sense in the 1980s, it seems crazy that in the decades since there's not been any serious work put into static validation. – supercat Nov 09 '18 at 16:30
10

Funny enough, some contemporary architectures, like ARMv4 do not have a hardware stack either. Any subroutine call/return is done through the link register that temporarily holds the return address.

However, what would be called a 'software stack' is still present: some other register is assigned the role of stack pointer: you save the link register where it points as soon as the subroutine is entered, you push and pop registers there through the use of multi-register moves that also pre- or post- increment or decrement register used as a stack pointer (however those instructions use equally any other register as a pointer).

Also worth noting is that a 'hardware stack' has also the meaning of a truly hardware, limited-size stack (made of shift registers or small memory block) that is used nowadays in 'big' CPUs as the predictor of the return address within the speculative execution framework.

Peter Mortensen
  • 260
  • 2
  • 7
lvd
  • 10,382
  • 24
  • 62
  • 1
    I think it's a matter of perspective. R13 is designated for use as a stack pointer, with the ARM docs noting, "In many instructions, you can use R13 as a general-purpose register, but the architecture deprecates this use of R13 in most instructions. For more information see the ARM Architecture Reference Manual." I guess the distinction you're making is that ARMv4 has no instructions that implicitly use SP, just general-purpose instructions that implement a stack, which by convention target R13 when used for procedure entry/exit. – fadden Aug 07 '18 at 14:51
  • 2
    @fadden: What changed with the ARM was that older versions of the architecture handled interrupts by having more than one R14 and R15, and the ability to switch between them, but newer ones push store registers to memory at the descending address using R13. – supercat Aug 07 '18 at 18:03
  • Arm has hardware stacks (indirect auto inc/dec), but not hardware call stack. The link register is used to avoid bus contention (two accesses to memory at the same time) – ctrl-alt-delor Aug 08 '18 at 06:51
7

On very early computers there was no hardware support for a stack, because it was not until they started to program it that they realised that they needed one. The solution was the Wheeler jump — invented by David Wheeler.

The Wheeler jump

This involves modifying the program as it runs. So will work on a Von Neumann architecture, but not Harvard. (Harvard came later, and will always have mechanism to implement a stack).

Before a subroutine is jumped to (branch/goto, type jump), the last instruction of the routine that is called is changed to a branch to the next instruction. This last instruction was left blank, for this use. The programmer, or loading tools, need to know the length of the routine that is called. It is a big pain.

Recursion is impossible, because there is only one place (per routine) where the return address is stored. However it is possible to have a subroutine, call as subroutine, call a subroutine. This is all that most programmers do anyway, and most recursion can be replaced by iteration.


This answer is about earlier times, than those mentioned in the question, because it gives some context, and shows what can be done, in relation to the question, with poor hardware support.

ctrl-alt-delor
  • 241
  • 1
  • 4
  • 1
    Can you point to a C implementation which used the Wheeler jump? I can't picture it. – Omar and Lorraine Aug 07 '18 at 13:10
  • 2
    This was way before C. Everything was done in assembler. They had a loader called initial instructions, it would do some translation of the program as it was loaded, including relocation. I don't know if it did wheeler jump adaptation. After this every CPU had stack support, I am pretty sure that the ones that you listed, do as well. – ctrl-alt-delor Aug 07 '18 at 15:29
  • 1
    As you say, recursion doesn't work with the Wheeler jump, so it really doesn't replace a stack - it replaces e.g. a subroutine jump where the current PC is stored in a register. That's still far from a stack. The "callee stores return address in local space" technique stayed common for a long time, e.g. for the PDP-8, and, as shown above, also for the IBM/360. I also doubt that the "next computer" (which? EDSAC II?) "they" (who?) built had a hardware stack - there wasn't enough memory to make that work effectively. Do you have any references for that claim? – dirkt Aug 07 '18 at 16:49
  • @Wilson Well, the SC/MPs subroutine handling is (somewhat) similar to a Wheeler jump, and ther is - marvel of todays hacking - a C-compiler produceing SC/MP code. Caveat, I didn't check how it handles calls. But that's most definitly an architecture far from standard call stack conventions :)) – Raffzahn Aug 07 '18 at 18:11
  • 1
    @Raffzahn I'd be interested if you could share a link. – Omar and Lorraine Aug 07 '18 at 18:12
  • This is not an answer to the question - still I will not flag it, as it's some valuable information about subroutine handling. Two points are missleading: a) The stack, and thus computers using it wasn't unvented until much later and b) Harward (as in seperate data and programm address space) do not need a stack either. All what's needed is a way to perform a calculated jump. A subroutine can now be build by storing the address where to return to in a common data location, jumping to the subroutine, which in turn wll retrive the address from it's variable/register/whatsoever and jump back. – Raffzahn Aug 07 '18 at 18:13
  • Ooops, sorry, @Wilson, totally forgot that: https://sourceforge.net/projects/c-compiler.scmp-emulator.p/ My bad. – Raffzahn Aug 07 '18 at 18:14
  • @Wilson there are some good videos on computerphile. They did two on Wheeler jump. – ctrl-alt-delor Aug 08 '18 at 06:15
  • 1
    If you try hard enough you can replace all recursion with iteration, not just most. It is mostly just not considered practically useful. This is the essence of the Church-Turing Thesis – Tim Seguine Aug 11 '18 at 14:14
  • @TimSeguine You are correct, with iteration (and Turing completeness), you can implement a stack, and simulate recursion. But in software everything is simulation, therefore a simulation of recursion is recursion, so we have not avoided recursion. – ctrl-alt-delor Aug 12 '18 at 09:52
  • 1
    @ctrl-alt-delor That's a pointless meta argument. You can reverse usage of the terms iteration and recursion without changing its substance. The entire point of the Church-Turing thesis is that you can do whatever suits your purposes without a loss of generality. You are trying to split an invisible hair. – Tim Seguine Aug 12 '18 at 13:00
  • The point about the Wheeler jump was that it provided a way to get back from a subroutine. i.e., what was really being invented was the "closed subroutine". Wheeler codified a sequence of EDSAC instructions to accomplish call and return. This is of course such a basic concept that it's hard to realize today that it had to be invented. The alterative was the "open subroutine", which got copied into every place it was needed (today we call that a "macro"). – dave Nov 09 '18 at 23:07
6

A program stack is a very simple data structure that requires only

  • indexed or indirect memory addressing (pointers)
  • indirect or computed jumps
  • a designated register or global variable holding the stack pointer/index
  • basic arithmetic, i.e. increment and decrement operations


  • Pushing a value means decrement the index, then store.
  • Popping is loading a value at the index, then increment the index.

(Of course one can swap decrement and increment to have an upward growing stack)

  • Calling a subroutine is pushing the return address (program counter plus some offset). If there is no other way to access the PC, the compiler could emit a direct register load followed by a push, and the linker can fix up the loaded value with the return address.
  • Returning from a subroutine is popping the return address from the stack, and jump to it.

Modern processors tend to implement these operations in single instructions simply because they are often used together, and doing so would save code space and speed up execution.

  • 2
    No, modern processors implement these because old processors did. These instructions are pretty much useless for modern compilers, because matching their effects against the abstract representation of the program is nontrivial when you also want to be able to reorder instructions for better efficiency, keep the invariant that the stack pointer points to an unused address (for exception handling) and emit usable debug information for the program that allows the debugger to find any variable on the stack. If you see push/pop emitted, that is a hardcoded pre-/postamble bypassing the optimizer. – Simon Richter Aug 06 '18 at 17:16
  • @SimonRichter - I think the answer was more referring to instructions like the 80186 ENTER and LEAVE rather than PUSH or POP, although it's still true that push and pop are commonly used (although generally only for argument passing and not temporary storage, for the reasons you cite). – Jules Aug 06 '18 at 18:36
  • 2
    I suspect that berendi and @SimonRichter has different ideas about when the divide between "old" and "modern" processors occurred. (About 1970 vs. about 2000?) The answer would be better if it explicitly stated what it mean by "modern processors" – Stig Hemmer Aug 08 '18 at 07:50
  • @StigHemmer exactly, I was thinking something like "modern processors like the 8080..." – followed Monica to Codidact Aug 08 '18 at 10:58
  • @SimonRichter: all modern compilers targeting x86 use call instead of push return_addr / jmp, and use ret instead of pop rcx / jmp rcx, because it's smaller and much faster. (There's hardware support to make it efficient, like return-address predictor stack for matched call/ret. See http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/. And for stack ops in general, a stack-engine that eliminates the updates to RSP in the out-of-order core, avoiding dependency chains through RSP and making push/pop single-uop. They were slow and avoided on P5 Pentium, but no longer.) – Peter Cordes Aug 08 '18 at 12:16
  • @SimonRichter: In fact, on current x86-64 hardware, push a dummy register is used instead of sub rsp,8 because it's faster on Intel (avoids a stack-sync uop). Why does this code pop into the same register twice?. You seem to be describing gcc -faccumulate-outgoing-args which used mov instead of push for stack-args calling conventions, but that hasn't been the default tuning for a while now. See https://godbolt.org/g/iFv2UH for calling printf with -O3 -march=pentium2 vs. -O3 -march=haswell vs. x86-64 with generic tuning. – Peter Cordes Aug 08 '18 at 12:22
  • @PeterCordes: How does the cost of call X / jmp Y compare with push Y / jmp X? – supercat Aug 01 '19 at 17:45
  • 1
    @supercat: push / jmp instead of call guarantees a branch mispredict when the target function returns. All x86 CPUs for a long time have had a return-address predictor stack that tracks call/ret instructions and assumes they're matched, because ret is an indirect branch that's otherwise hard to predict. http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/ – Peter Cordes Aug 01 '19 at 19:11
4

The 6502 has only a stack of 256 bytes, so cc65 implements a stack in software.

Polluks
  • 465
  • 3
  • 7
3

Even nasty recursive stuff can be converted to iteration. For example, take a look at this:

Which does exactly that on a platform where recursion is not allowed probably due to the same reasons you are asking about (not sure if a GPU have stack pointer or not).

Anyway, if a computer has memory access with variable addressing (like by register) then you can implement a stack on your own. It is easy, just a few lines of code ... For example, you can do basic stack emulation stuff in Z80 assembly like this:

StackPtr: dw 0

push_a:
 ld hl,(StackPtr)
 dec hl
 ld (StackPtr),hl
 ld (hl),a

pop_a:
 ld hl,(StackPtr)
 ld a,(hl)
 inc hl
 ld (StackPtr),hl

If you do not have addressing by register then you need to do it by self-modifying code where you overwrite the memory access instruction address on a fixed memory position which is then used ... The same goes for call,ret jumping.

Ackermann function

As I mentioned before, even nasty recursive stuff can be converted to an iterative process. Here for example, the Ackermann function in C++:

//---------------------------------------------------------------------------
// https://en.wikipedia.org/wiki/Ackermann_function
//---------------------------------------------------------------------------
DWORD ackermann_r(DWORD m,DWORD n)
    {
    if (!m) return n+1;
    if (!n) return ackermann_r(m-1,DWORD(1));
            return ackermann_r(m-1,ackermann_r(m,n-1));
    }
//---------------------------------------------------------------------------
DWORD ackermann_i(DWORD m,DWORD n)
    {
    const int sz=128;               // stack size
    int i=0;                        // stack pointer
    DWORD a[sz];                        // stack
    for (;;)
        {
        if (!m)
            {
            n++;
            if (!i) return n;       // final result
            i--; m=a[i];            // previous recursion level
            continue;
            }
        if (!n) { m--; n=1; continue; }
        if (i>=sz) return 0;        // recursion too deep
        a[i]=m-1; i++;              // new recursion level
        n--;
        }
    }
//---------------------------------------------------------------------------

The _r means recursive and _i means iterative. Of course, without bignum arithmetics we are limited to very small m, n input here comparison of the two functions:

[recursive Ackermann(m,n)]
  m\n      0      1      2      3      4 
    0      1      2      3      4      5 
    1      2      3      4      5      6 
    2      3      5      7      9     11 
    3      5     13     29     61    125 

[iterative Ackermann(m,n)]
  m\n      0      1      2      3      4 
    0      1      2      3      4      5 
    1      2      3      4      5      6 
    2      3      5      7      9     11 
    3      5     13     29     61    125 

The DWORD is an unsigned 32 bit integer... As you can see, I am using my own stack/heap implementation inside the iterative function (but that does not make it recursive!!!) nor is it using stack-related instructions, just memory access ...

The drawback is that you are limited with heap/stack size (this stuff grows really fast), but if you realize you can have a global heap/stack for all the functions even this limitation partially disappears ... But do not befouled; even recursion is limited in the same way as the stack/heap of a commonly compiled program is usually limited too...

Another option is to compute/estimate the needed stack size for example by some series) and use dynamic allocation ... It's also doable without the estimation if we use a dynamic-linked list instead...

While testing, beware this stuff grows crazy. For example:

ackermann_i(4,1) = 65533

took 9.550399 seconds to compute and used sz=65531 DWORDs (while the recursive version stack overflowed already as iterative functions usually need less space than their recursive counterparts).

Peter Mortensen
  • 260
  • 2
  • 7
Spektre
  • 7,278
  • 16
  • 33
  • 1
    Not all recursion can be converted to iteration, though it is rare to come across it. – ctrl-alt-delor Aug 07 '18 at 12:40
  • @ctrl-alt-delor that is not entirely true (at least to my knowledge) as you can always write iterative parser/interpreter and emulate recursion even then ... I also saw few times that recursion was hardcoded/compiled as "unrolled" code instead of recursive subroutines. – Spektre Aug 07 '18 at 12:43
  • I agree, unfortunately I can not correct your knowledge, as I do not remember where I read it. It was very academic. There may not be any real world situation where these algorithms are used. However in most practical cases you are correct. Though in is only tail recursion that can be easily unrolled into iteration. Arbitrary recursion can get practically impossible. I think that is what the paper was about. To answer the question, is it just practically impossible, or are there situations that is is impossible. They found a class of algorithm where is was impossible. – ctrl-alt-delor Aug 07 '18 at 12:55
  • 2
    The 'self-modifying' remark in the answer holds true about the way PDP-8 made its subroutine calls. The calling instruction stored the return address in the first word of the subroutine body and continued execution from the second word. – lvd Aug 07 '18 at 14:13
  • @lvd: That was exactly how the Prime R-mode call instruction worked too. – Martin Bonner supports Monica Aug 07 '18 at 15:19
  • @lvd I started with computers much much latter when stack was taken for granted already but IIRC first time I encountered this techniques was in ZX assembly to speed up Bresenham and also in some calling conventions from higher level languages (on MCUs I think). Later on I used that in x86 assembly for hardcoded string subroutines like in here Graphics mode in assembly 8086 look for call printl usage (although the technique uses modifying stack there... but the original code used modifying jump) – Spektre Aug 07 '18 at 15:51
  • I found found a function that can not be implemented recursively “Ackermann function” The Ackermann function is the simplest example of a well-defined total function which is computable but not primitive recursive. — https://en.wikipedia.org/wiki/Ackermann_function There is also a set of videos on it here https://www.youtube.com/watch?v=HXNhEYqFo0o – ctrl-alt-delor Aug 08 '18 at 06:35
  • 1
    Note the Z80 has a hardware stack. And that code is not stack emulation, it is an implementation of a stack. A stack is a data structure, not a hardware feature. – ctrl-alt-delor Aug 08 '18 at 06:39
  • 1
    @ctrl-alt-delor was curious so I tried to encode the Ackermann function both recursively and iteratively and both works so it looks I was right (until someone create another horribility that really can not be converted to iteration but I am afraid such stuff would require construct/syntax even nowadays recursive languages do not allow for now... – Spektre Aug 08 '18 at 08:08
  • @ctrl-alt-delor I repaired the emulation like you suggested. btw I just realised you wrote function that can not be implemented recursively “Ackermann function” Until now I memory read it as function that can not be implemented iteratively “Ackermann function” which is what you most likely meant in the first place :) ... its refreshing to see that I am not the only one making "typos" like that – Spektre Aug 08 '18 at 08:52
  • @Spektre yes my bad (not really a typo, more of just me mixing the words up). – ctrl-alt-delor Aug 08 '18 at 10:36
  • 2
    Note that ackermann_i is recursive, it just does not use recursive function calls. However it implements its own recursion. This shows you don't need recursive function calls, but you do need to be able to implement a stack. – ctrl-alt-delor Aug 08 '18 at 10:41
  • @ctrl-alt-delor: exactly. You need some kind of stack for Ackermann, or for in-order tree traversal (without modifying the nodes in-place to record state). I used the same example in an answer to What are good examples that actually motivate the study of recursion?. It's easiest (in C) to use recursion but asm can conveniently use the call-stack (if there is one) as a stack data structure. BTW, Ackermann is computable but not "primitive recursive" in formal CS terminology, but unfortunately IDK what "primitive recursive" means :/ – Peter Cordes Aug 08 '18 at 12:01
  • In order to resolve recursion into iteration, the compiler has to recognize it as such. This is not always possible, especially if recursion occurs between translation units. – tofro Aug 08 '18 at 16:39
  • @tofro That is an artificial restriction compiler writers imposed upon themselves for scalability. – Tim Seguine Aug 11 '18 at 14:18
  • 1
    @TimSeguine I would really hesitate to call the possibility to modularize programs an artificial restriction. – tofro Aug 11 '18 at 14:29
  • Re "9.550399 seconds": That is 7 significant digits. Are you sure that is warranted? – Peter Mortensen Aug 11 '18 at 20:07
  • @tofro there are other ways to facilitate modularity. That design choice was not strictly about modularity, but about separate compilation, which is sufficient but not necessary for modularity. We're in the weeds though. The point I was making is that they can't see recursion because they made an (arguably valid) design choice that limits their ability to detect recursion. I called it artificial because they didn't have to make that choice. – Tim Seguine Aug 12 '18 at 12:43
2

Firstly the PDP-7 had autoincrement capability for some of its memory registers (addressing was direct or memory indirect); the PDP-11 also has autoincrement registers. Note that many CPUs of that era did not have a stack for subroutine calls but instead used a "frame pointer", saving the old PC as an index to the calling routine's parameter data. Second, hardware features can be implemented in software, both in the original Assembly language version of Unix as well as Unix written in the C language. The Unix clone, Minix, was written for the i8086 originally and has a software Virtual Memory Manager (most newer CPUs have a hardware one). PDP-7 instruction set: http://simh.trailing-edge.com/docs/architecture18b.pdf

Steve J
  • 141
  • 1
  • 1
  • 2
    Yes, but how does it answer the question? – Omar and Lorraine Aug 07 '18 at 04:53
  • is that a link-register, or frame-pointer? – ctrl-alt-delor Aug 07 '18 at 12:39
  • Note that the most popular ≥32bit CPU of this era (the Arm). Does not have a hardware call stack, it too has a “link register” (not a frame pointer, a flame pointer is used to automate stack un-rolling, is stored on the stack, and can be used with or without a link register). In CPUs that use a link register, it is usually possible to push the link-register on to a stack. – ctrl-alt-delor Aug 08 '18 at 06:46
  • @ctrl-alt-delor: Some RISC ISAs don't have a dedicated "stack register", but the ISA (with plenty of regs and convenient instructions) makes it efficient to dedicate any of the GPRs as a stack pointer. So by software convention you do have a normal call stack on MIPS or PowerPC, for example, even if they never modify it on their own even for exceptions / interrupts. (I know ARM has register banks, but I thought there were some interrupt conditions that did make it automatically use the stack asynchronously, at least on some flavours of ARM. ARM is less RISCy than the full-on RISC ISAs.) – Peter Cordes Aug 08 '18 at 11:49
  • On Arm (at least originals), there are 14 general purpose registers. Plus program-counter, and link-register. The link-register & one other register are banked, depending on mode (the intention is that you use this other banker register as your call stack, but there is nothing else special about it.). Interrupts put the return address onto the link-register of interrupt mode, interrupts are then disabled, the interrupt handler must save the link-register [on the stack], and re-enable interrupts. (so very similar to mips/power). I think only non-risk instructions are load/store multiple. – ctrl-alt-delor Aug 08 '18 at 14:14
  • Just to say that a very early high level language using the auto-increment capability was FORTH which has some great capabilities and unusual effects, because for the most part it was simply a list of addresses to call, without the need for a 'CALL' instruction. That said it also uses a separate data 'stack'. https://www.forth.com/resources/forth-programming-language/ for some history (I used it from 1976..). – Philip Oakley Aug 08 '18 at 15:21
1

Looking at the manual for PDP-7 (1964), it has:

  • Jump and copy program counter to accumulator (jsp)
  • It also has memory indirect addresses, dac i deposit indirect, lac i load indirect.
  • With up to 8 of these being auto incrementing (depending on memory size).

These last two can be used to implement a stack.

The first uses the accumulator as a link register (the arm has a dedicated link register). The link register (accumulator) can be copied to/from the stack.

Peter Mortensen
  • 260
  • 2
  • 7
ctrl-alt-delor
  • 241
  • 1
  • 4
  • Can you point to an implementation which does this? – Omar and Lorraine Aug 07 '18 at 17:59
  • @Wilson I have only looked at the manual. I did wonder after writing this, how would a pop be done. This needs decrement, so possibly with more instructions. – ctrl-alt-delor Aug 08 '18 at 06:18
  • @ctrl-alt-delor why two answers? I would merge them into one ... – Spektre Aug 08 '18 at 08:46
  • What do you mean by "the arm has a dedicated link register"? Do you mean "the RAM has a dedicated link register"? – Peter Mortensen Aug 11 '18 at 20:13
  • @PeterMortensen No, it is not in RAM. It is one of the core 16 registers (R14). Most of these 16 are general purpose, expect LR=R14, PC=R15, SP=R13 (R13 is less specialist, just that the OS may save your context on this stack. And like R14=LR and the condition code register , it is hardware banked across all modes.) – ctrl-alt-delor Aug 12 '18 at 10:09
-4

There is no such thing as a "hardware stack"; what you're calling a hardware stack is presumably just push/pop (and possibly call/ret) instructions that operate on a specific fixed register and memory addressed by it.

Unlike some other answers claim, no "dynamic memory management" is needed in the absence of such instructions. You just do the same thing you always do to implement call frames with a moving stack pointer, choosing a particular register for use as the stack pointer (at least at call time; if you need to support interrupts/asynchronous events, it probably has to be permentantly-reserved rather than just at call time. Then you just increment/decrement it and store/load at addresses based on it just like you would with a register that the ISA specifically designates as a "stack pointer".

  • 1
    "Hardware stack: A common use of stacks at the architecture level is as a means of allocating and accessing memory." https://en.wikipedia.org/wiki/Stack_(abstract_data_type)#Hardware_stack – Bruce Abbott Aug 07 '18 at 06:05
  • @BruceAbbott: As far as I can tell from reading the article, it's exactly what I said: a misleading term for a dedicated-purpose register and CISCy stack manipulation instructions in the ISA. – R.. GitHub STOP HELPING ICE Aug 07 '18 at 22:58
  • 1
    Not misleading. A hardware stack is maintained by hardware with minimal (ie. invoking single opcodes) to no (eg. interrupt vectors) software involvement. A RISC machine that requires sequences of several instructions to implement a stack is using software to do it. For example the RCA CDP1802 needs ~26 instructions to implement standard Call and Return functions that are single instructions in other CPUs (which may use random logic or dedicated microcode to perform the actual stack manipulations, but that is hardware not software). – Bruce Abbott Aug 08 '18 at 09:23
  • 1
    Some ISAs, like x86, do in fact have a stack as part of the architecture. Interrupt handling asynchronously pushes stuff onto the kernel stack (What happens in the x86 architecture when an interrupt occurs?), so it's not optional to have the kernel RSP pointing to valid memory. You can implement functions without using call/ret or push/pop, have a red-zone for userspace, or even use a different register as the user-space stack pointer, but you can't avoid having a valid kernel stack pointer in RSP unless you run with interrupts disabled. – Peter Cordes Aug 08 '18 at 12:07
  • 1
    But sure, for the purposes of this question, the important thing is having enough registers to dedicate one as a stack pointer, and efficient instructions for manipulating it in software. It doesn't actually matter if HW uses it implicitly as a stack pointer or not, because x86-64 and MIPS aren't fundamentally different wrt. how C is implemented. Non-leaf callees pushing a link register achieves the same goal as call pushing a return address. – Peter Cordes Aug 08 '18 at 12:10