Why does the 6502 JSR instruction only increment the return address by 2 bytes?

Question

Currently messing with 6502 assembly on a C64, and I don't understand why the JSR instruction is so weird.

According to the instruction table, JSR is a 3-byte instruction and only operates in absolute mode. However, JSR only increments the PC by 2 before pushing it on the stack. Which means the return address points to the last byte of the JSR instruction. It seems the RTS pops the value from the stack and increments it again before setting the PC to the corrected value.

My question is: Why? Why not just let JSR increment the PC by 3 instead of 2, and let RTS just pop and jump? This looks like a far more logical approach. Any reason for making this so complicated?

@Jeroen Jacobs: The 6502 does not have a linear address space. Instead, it's organized in 256 pages of 256 bytes. Jumps that cross page boundaries need a "correction" cycle that increases the PCH register so the adress doesn't wrap inside the page. — Janka, Apr 10 '21 at 15:18
@Janka nop. That correction is only needed for relative jumps. Reading program code is done via PC, which can be incremented across page boarders without penalty. — Raffzahn, Apr 10 '21 at 16:22
You're implying that doing that the 'logical way' is better but by what metric? JSR/RTS work perfectly so looking superficially unusual is no problem - it's not a beauty contest. When you explore how the CPU carries out the instructions, this allows simpler operation with simpler circuitry. So the actual metric of 'better' is lower transistor count, either to lower the price, improve manufacture or use those transistors for other functions. — TonyM, Apr 10 '21 at 18:37
@TonyM The metric to what I compare is "what makes logical sense to me personally". Having the return address to the next instruction on the stack (which is common today), instead of the address to a part of the previous instruction. I'm not attacking the design of the 6502, and I don't know how CPU's are designed. I was just wondering why it was implemented in a way that seems awkward to me. That question has been answered, so I'm not going to debate on this. I never even said the way I am familiar with is "better". — Jeroen Jacobs, Apr 10 '21 at 20:07
@JeroenJacobs Well, what is 'common today' or not is open to a lot of discussion. After all, many modern architectures (MIPS, SPARC, PowerPC, ARM, RiscV) do not (have to) store the return address in memory at all - at least not as part of the subroutine jump - and likewise do not need to read it again. Doing so is a bottle neck for performance, to be avoided whenever possible. — Raffzahn, Apr 11 '21 at 05:47
Having the stack receive the address of the next instruction would allow RTS to be processed in 5 cycles rather than 6. I'd say say that would make such behavior more "logical", all else being equal. The 6502 designers probably expected that the design they chose would save cost, though it would have been impractical for them to determine the costs of both approaches, and determine whether the savings would be meaningful, before committing to one approach or the other. — supercat, Apr 12 '21 at 16:40
@TonyM: [see above]. I don't know if there's any practical way to experiment with tweaks to the 6502 design to see how variations would have made things cheaper or more expensive, but there are a number of places where it might have been practical to improve functionality at little or no cost (and in some cases end up with things being cheaper). For example, given how much unused opcode space there is, how would the cost of having separate instructions for all 8 combinations of binary/decimal add/subtract with/without carry compared with the cost of the decimal flag as well as... — supercat, Apr 12 '21 at 16:48
...the sed/cld/sec/clc/clv instructions, the first two of which would be rendered obsolete, and the latter three of which could have been accomplished--when still needed--via "add #0" and "sub #0". While those would be a byte longer than "sec" and "clc", so many uses of "sec" and "clc" could be eliminated that almost all programs would become smaller. — supercat, Apr 12 '21 at 16:52
@supercat, when we talk about 6502 cost, we can only look at: chip development, manufacturing (inc. yield, test) and field returns (warranty). And the development costs need to be recouped across first 'n' years of sales. The chip layout was all done by hand (accounts are an interesting read, though I imagine you've read them many times) and so was pretty inflexible and needed lots of paper and thoughtful hours. — TonyM, Apr 12 '21 at 17:21
@supercat (cont'd), So the costs of integrating those changes into the partly-complete design, or to first architect the layout further and delay layout, were considerable in their situation where they had little money. We all know that hindsight is the only exact science and it's fun to look back on these things and what seems clear once the storm has gone and the dust settled. But I think your note of what could be 'practical' to improve functionality at little or no cost' may bear little resemblance to the team's actual situation and the costs such changes would really incur. — TonyM, Apr 12 '21 at 17:25
@TonyM: I don't think the 6502 team expected the chip to be as successful as it was right out of the gate, and I acknowledged that it would have been impractical to fully assess the costs of multiple alternative approaches before committing to one. Still, I think it interesting to look at how designs were affected for better or worse by the need to commit to certain aspects early in the development cycle. Among other things, an important aspect of producing good designs is knowing which parts to lock down at what point in the development process, and what parts should be left flexible. — supercat, Apr 12 '21 at 19:22
@TonyM: I also find myself curious how something like the CPU in the Nintendo Entertainment System might have been different had it sought to borrow the general design of the 6502 without simply copying the artwork. Some of Nintendo's other products use CPUs that are very much like existing designs but not machine-code-compatible, so if someone were tasked with making a CPU that would seem familiar to a 6502 programmer, but need not be machine-code compatible, it's interesting to know what aspects of the 6502 might have been tweaked. — supercat, Apr 12 '21 at 19:37

Raffzahn · Accepted Answer · 2023-07-04T14:31:38.633

Why does the 6502 JSR instruction only increment the return address by 2 bytes?

Simply because the PC is already pushed before the second address byte is read. That way, the CPU need only buffer the lower target address byte and later read the higher one directly into the PC.

The workings are described in great detail in the original 1976 MCS 6500 Family Programming Manual (*1) in section 8.1 JSR - Jump to Subroutine on p.106..109. Same for how RTS resolves this in 8.2 RTS - Return from Subroutine (p.109..112)

The six clock cycles of a JSR are essentially:

Read Opcode ($20); Increment PC
Read ADL; Increment PC
Buffer ADL
Push PCH; Decrement S
Push PCL; Decrement S;
Read ADH;

And interleaved with the next instruction:

Load PC with ADH/ADL; Fetch next OP with new PC

Step 3 is another result of making the 6502 as small as possible

It seems the RTS pops the value from the stack and increments it again before setting the PC to the corrected value.

Yes, it does so during the last cycle of an RTS. In fact, in doing so, it rereads the last byte of the JSR instruction again (and discards it).

My question is: Why?

The main reason is to save circuitry. The way it operates, it avoids the need to buffer the upper address byte. Otherwise it would have needed a whole additional 8 bit register to hold that value. The 6502 has only 16 registers total. Adding one more would be a considerable cost.

It's worth keeping in mind that the main success criteria for the 6502 wasn't its inherent beauty or the friendly smile of its developers. It was being dirt cheap. Not just a few percent cheaper, but up to ten times lower than its competition. Woz selected it for exactly that reason for the Apple II and so did Atari and others. It's also why Commodore bought MOS, it's well known how cost-aware Tramiel was ;)

Having to add a whole register just for a single function is, in that context, a no-go, especially if there's a way to do it in microcode.

Why not just let JSR increment the PC by 3 instead of 2, and let RTS just pop and jump?

It has to be incremented anyway, so no real gain here.

This looks like a far more logical approach.

In what logic? Maybe in CS class ivory tower logic, but real hardware is about implementing a concept in best possible fashion, not 'as in the books'. The important thing is the function provided, which is the same in either case.

The resulting effect is that any function that uses the pushed address, such as for accessing parameters, will need to increment it (or use an offset of one). This is in line with the general 6500 philosophy to spend as little hardware as necessary to provide a function, and let everything else be done by software.

Quite RISC-like, isn't it?

*1 - Always a great first read, together with the MCS6500 Family Hardware Manual.

@WayneConrad ROTFL ... yeah, then again, not wrong either. It's a RTFM story :)) — Raffzahn, Apr 10 '21 at 16:36
Thanks for this interesting explanation. The reason I find it complex and not very "logical", is because the return address on the stack does not point to the next instruction after return, but to the last byte of the JSR itself. This just feels totally weird to me (compared to x86 architecture where the return address pushed by CALL is the address of the actual next instruction, RET just pops and jumps) :) I'm not a hardware engineer, so I don't know the difficulties or costs of creating a CPU (certainly not back in those days). — Jeroen Jacobs, Apr 10 '21 at 17:29
@JeroenJacobs Well, an 8086 got more than 10 times the gate count of a 6502. Also, more important here, it handles data as 16 bit chunks, so no need to juggle with two bytes. In addition this is handled by special circuitry of the BIU anyway. It's predecessor, the 8080/Z80 line does in fact have a hidden register pair (WZ) to hold the read address thur all of this. The 6502 simply improved thereon. — Raffzahn, Apr 10 '21 at 18:04
From what I can tell, in machines of the 6502 era, the main cost of registers was the wiring to get data into and out of them. Inserting an extra 8-bit latch on the input to the upper program counter byte, between the bus that fed it and the register itself, would seem like it would not have been difficult, especially if a dynamic latch would have sufficed (if the CPU wrote the high byte of the PC of the third byte of the instruction on the cycle after it fetched that byte, then wrote the high byte of the PC of the instruction after that to the same address, ... — supercat, Apr 10 '21 at 21:32
...it could load the PC with the fetched PCH at that time, without having performed any read cycles between the time the new PCH was loaded and the time it's copied to the register. An interesting aspect of this approach is that it might have been possible to shave a cycle off the cost of any JSR that isn't immediately followed by a page boundary, since writing the PCH value for the third byte of the instruction would eliminate the need to write the PCH value for the following instruction. — supercat, Apr 10 '21 at 21:35
I don't think I'd describe the last step as "interleaved" as "in preparation for", on the basis that the next instruction can't really start executing until it's fetched. Also, does PC get loaded with ADH:ADL, or is a byte fetched from ADH:ADL while PC gets loaded with (ADH:ADL)+1? — supercat, Jul 04 '23 at 16:13
@supercat The address is handled during the first half while the fetch i during the second. The way 6500 pipelining always hasworked. — Raffzahn, Jul 04 '23 at 17:12

score 8 · Answer 2 · answered Apr 11 '21 at 20:03

8

The answer with anything of that age is "to save silicon". That pushed address was never intended for programmer use.

I coded 6502 professionally for years, using every possible trick to push the limits of the metal, and that issue never really came up for me. It sounds like you're trying to do a JMP to a variable address by pushing the address on the stack and going RTS.

Consider the indirect JMP command JMP (ADDR). This doesn't force you to use a Zero Page address, but you can if you wanna: JMP ($00C8). So you can use the cheaper Zero Page store commands.

I'm not sure why they don't have a zero-page JMP command like JMP($C8) but probably it's used too infrequently to spend the silicon on that.

Do not use an indirect address ending in $FF unless you want to meet a bug.

Of if you really want to do an RTS jump, and are loading an absolute address onto the stack, then use a compiler meta-instruction to decrement it by 1 before pushing it, so that the compiler takes care of the -1 for you.

answered Apr 11 '21 at 20:03

Harper - Reinstate Monica

4,019
15
20

I noticed this behaviour because I was playing with inline arguments. This requires pulling the return address and put it a zeropage address, and then use indirect-addressing. I was expecting to find my parameters there, but noticed the extra byte belonging to the original RTS instruction. I found this surprising, so that's why I asked here what the reason was. People seem to make more of this question than was my intention ;) – Jeroen Jacobs Apr 13 '21 at 08:46
True, but it's fun. Yeah in the past when I've done that (put args under the return address), I popped the return address and stored it, then after collecting my args, I put back those same values. I considered it "not my data to tamper with" and that a future rev of the 6502 or backward compatible Zilog Z65 or something, might handle the values differently, so don't break it lol. – Harper - Reinstate Monica Apr 13 '21 at 21:44
1

"Do not use an indirect address ending in $FF" - Ah yes, that bug :) (Though it was corrected in later 6502 variants, IIRC, so depending on your hardware you may miss out on the joy of meeting it!) – psmears Feb 08 '24 at 13:54

Why does the 6502 JSR instruction only increment the return address by 2 bytes?

2 Answers2