How are 6502 and 65C02 JMP(abs) processed internally

Question

An infamous "feature" of the 6502 is that the JMP(abs) opcode will fetch the lower byte of the jump target from the address in the instruction, and fetch the upper byte of the jump target from the address whose lower byte is one higher than the address in the instruction, wrapping after FF. Thus, JMP ($10FF) will fetch the target address from $10FF and $1000 (rather than $10FF and $1100). From my understanding, the 65C02 unconditionally inserts a cycle in which the ALU either increments or doesn't increment the upper byte of the address to fetch the PC.

Given that there is no other instruction that allows any kind of indirect memory operation from an absolute address, and given that the old program counter value will not be needed once the third byte of the instruction has been fetched, I'm curious how much circuitry was used to provide the instruction's unique operation sequence, and how that would compare with having JMP(abs) sequenced like JMP abs except that after the third cycle it would perform two more cycles which were just like the second and third (the second and fourth cycles would fetch a byte at PC, copy the bus to the data latch, and increment the PC; the third and fifth would fetch a byte at PC, and load PC from the data latch and the bus).

Does the 6502 and 65C02 manage to leverage other existing hardware in an interesting way so that their implementations of the instruction end up being surprisingly cheap, or is there something about the design of the 6502 and 65C02 that would make it impractical to change the PC mid-instruction (without e.g. risking having an incorrect PC value stored if an interrupt occurs at an inopportune moment).

The sequence of micro-ops that form the instructions in the 6502 are fired off from a PLA "decode ROM" built into the CPU - http://visual6502.org/wiki/index.php?title=6507_Decode_ROM - Not sure if anyone has dumped/examined the decode ROM for the 65C02 but probably there's a step missing in the gates that determine JMP ($XXXX)'s list of micro-ops. Interrupts on all these CPUs happen between instructions only except for RESET (that's why some undocumented instructions lockup the CPU because the decode ROM for them never hits the "finish instruction" step). — LawrenceC, Aug 12 '16 at 19:26
@LawrenceC: Thanks. I don't think I'd seen the decoded table; I'd seen the silicon, but no explanation of how the table works. Interesting that there are a number of decodes which handle both $4C and $6C [JMP addr and jmp(addr), respectively], one which is unique to 0x4C [JMP addr], and none which would be unique to 0x6C, except that one of the decodes only fires in the fifth cycle. — supercat, Aug 12 '16 at 19:39
I don't see why decoding JMP (abs) via the PC in the 4th and 5th cycle wouldn't work. After all, it's identical to the 2nd and 3rd cycles, and loading the PC is known to work because of JMP abs. The ultimate proof would be to modify the ROM/control logic in visual6502 to do that and see what happens. As for interrupts, I always thought interrupts could only happen on the opcode fetch cycle, because the 6502 can't resume mid-instruction, and the first opcode address isn't stored anywhere. — dirkt, Aug 13 '16 at 09:11
BTW, it's a lot easier to understand what's going on if you look at the block diagram (signal names are available in visual6502) and/or Balazs' schematics (with different signal names, translation table in the visual6502 wiki). All links available from the visual6502 pages. — dirkt, Aug 13 '16 at 09:15
@dirkt: I'll have to search for information about tinkering with the decode ROM I guess--that sounds fun. As for interrupts, I think they need to be acted upon in the last cycle of an instruction in order to influence the first cycle of what will be the next instruction. There'd be no way for an interrupt to be "properly" handled in the middle of something like JMP (abs), but that doesn't mean that logic that tries to treat "JMP (abs)" as a JMP followed by an effective repeat of the second and third cycles wouldn't need to ensure that the third cycle of "JMP(abs)" wasn't seen... — supercat, Aug 13 '16 at 17:06
...as the "end" of an instruction for purposes of interrupt logic. — supercat, Aug 13 '16 at 17:06
The "interrupts in last cycle of instruction" issue is handled automatically by assigning the T0 and T1 states correctly, see answer below. So that's not a problem. — dirkt, Aug 14 '16 at 07:42
I personally suspect it's a bug in the microcode not an intentional feature. — Joshua, Dec 06 '16 at 03:48
@Joshua: The behavior of JMP (abs) is almost certainly an unintentional quirk, but it's one which I would think could have been fixed without net added complexity if the instruction had been sequenced differently. I find it curious that the 65C02's fix was to add an extra cycle to JMP (abs) eliminating its timing advantage versus using a jump to a jump. — supercat, Dec 06 '16 at 13:54

dirkt · Accepted Answer · 2016-08-15T11:50:07.140

I can't answer the question "why doesn't the 6502/65C02 use the PC to do the indirection to execute JMP (abs)", but I can answer some of the other points.

Visual6502 is the best site to find out about the inner workings of the 6502.

I'll be using the "Random Control Logic" signal names as used in the simulator, which are the same names used in Hanson's block diagram. If you rotate the diagram by 90 degrees, it roughly corresponds to the silicon (actual size used on silicon is different, so take it with a grain of salt). The block diagram is sufficient to understand what is going on, and easier to read than the silicon.

For the ROM decoder, I'll use the simulator names, which are different from the 6507 mnemonics.

You can follow both types of signals by adding a loglevel of at least 6 to the simulator URL, or by pressing Trace more often enough in the expert simulator mode. Decoding has "don't care" values, so some of the signals may be accidental.

How do JMP abs and JMP (abs) work internally?

(What circuitry is used? Is the implementation cheap?)

In general, instructions are executed through several states (numbered T0 to T6). The 6502 has a kind of two-stage pipeline, where the next instruction is already fetched and decoded while the last stage(s) of the previous instruction still executes. In T0, the next opcoded is prepared to be fetched, and in T1, the opcode is decoded, and the next byte after the opcode is prepared to be fetched. This is why every instruction execution ends with T0 and T1, and starts with T2 if more than two cycles are needed. (I'd assume T0 is also the phase where the 6502 checks for interrupts, and replaces the opcode with a BRK if it finds one, but I didn't verify this).

So beginning in the T2 cycle of JMP abs, the low byte of the destination address is fetched. It is output on the DB bus (DL/DB), then stored in the the B input register of the ALU (DB/ADD), added to a zero (SUMS, 0ADD) from the A input, and finally temporarily stored in the adder hold register. This is necessary because the T0 cycle already loads the high byte of the destination address onto the ADH bus, and both must be stored in PCL and PCH at the same time (ADD/ADL, ADL/PCL and DL/ADH, ADH/PCH).

For JMP (abs), the first cycles are very similar (the only different ROM signal is the absence of T2‑jmp‑abs), but in T3, the operand address is only output onto the address bus and not stored in the PC. The low byte of the final destination address that is fetched this way is eventually again stored in the B input register. But before that, the old value of the B input register is incremented by 1 using the carry, and this is used as the address of the high byte of the final destination address. Overflow for that is not handled, which explains the "infamous wrapping effect". The final transfer to the PC works similarly as for JMP abs.

So both jumps use the datapath circuitry just like every other instruction, it's neither particularly cheap nor expensive.

How could one test a JMP (abs) implementation using the PC?

(Is there anything that would prevent it from working?)

You could modify the decode ROM and decode logic in the simulator. The simulator uses three files to store the silicon data in a simple format, which are available on github. There are already special signals T3-jmp and T4-jmp for the "indirect" phase, and you may need an additional signal (or re-use one of those two) to ensure loading of the PC in the T2 phase in absence of T2-jmp-abs. And of course you need to rewire the T3 and T4 phase.

You'll likely need some more space for transistors, so a tool to create horizontal or vertical space (increment all coordinates greater than some threshold) in the data files would be helpful.

Having T0 be regarded as part of the preceding instruction makes sense; I find it a bit surprising that T1 would be as well. I don't know if it would be worth asking a separate question, but I've also sometimes wondered how hard it would have been to add "one-cycle NOP" instructions which would act halfway like asserting the "ready" input (stay in T0, but allow PC to increment). Not only would a one-cycle NOP would not only be useful for getting cycle-accurate timing, but if there were a broad range of bit patterns that decoded that way, as well as perhaps a range of bit patterns that... — supercat, Aug 14 '16 at 18:49
I think the problem with 1-cycle ops is that the decoding circuitry (instruction register, decode ROM, control logic) is too just deep for that - you need 2 cycles (namely T0 and T1) before the datapath control signals are available and you can do anything with them. So even if the operation is trivial (NOP) or very simple (CLC), it still takes 2 cycles. — dirkt, Aug 14 '16 at 18:56
...would behave as either an unconditional branch or a two-byte NOP based upon the state of some wire (perhaps use /SO for that instead of setting the overflow flag) that would have made it easy for system designers to have external decoding for one-cycle bit-set, bit-clear, or bit-wait instructions instructions, and 2-3 cycle bit-test. Even if the original 6502 didn't use such techniques, they would have seemed a nice approach for adding I/O to e.g. the 6510. — supercat, Aug 14 '16 at 18:56
Something like CLC needs to have a proper T0/T1 sequence to do anything. I'll have to study visual6502 to see how READY is handled, and whether opcode-based stalling would work. Otherwise, I've seen an article which suggested inserting logic between the data bus and 6502 to add externally-decoded instructions, and think the idea would have had a lot of merit if handled on-chip. Much as I like "opcodes" like SAX, LAX, and DCP, having xxxxxx11 as a recognized NOP would have made it easy for system designers to add fast I/O; even if it took two cycles, that would be better than... — supercat, Aug 14 '16 at 19:01
...having it take 3 cycles if one is willing to give up zero-page addressing space (using separate address for "set" and "clear") and 4 cycles otherwise. Or more cycles if one doesn't use separate set/clear addresses. — supercat, Aug 14 '16 at 19:02

score 2 · Answer 2 · answered Aug 22 '22 at 20:34

With regard to potential problems with changing PC in the middle of instruction execution, to my knowledge, interrupt generation, except for RESET, can only occur when the CPU receives a signal to reset the instruction timer, which typically happens during the final cycle of instruction operation. The only exception to resetting the timer would be the nefarious 'x2' instructions, in which no reset signal is ever sent because nothing triggers one, a behavior possibly also present in '80,' after which the timer 'expires' after cycle 6/7, thus leaving the CPU in virtual limbo.

If when the timer resets, ~IRQ|I*~NMI*~RST is low (I believe ~RST force resets the timer), normal execution is halted and BRK is executed without affecting PC, setting the B flag written to the stack to ~IRQ|I, I think, and setting A2 to ~NMI and A1 to ~RST, most likely (which would explain the behavior of all the hardware interrupts). I am mostly drawing this analysis from my limited examination of the 6502 schematic and the knowledgeable reports on 6502 behavior that I have read. I am not an expert on the actual functionality of the 6502. Note that there is no flip-flop for B in the CPU, its behavior is only present during interrupt execution (that is, B is always 'set' except when an IRQ is generated).

As for the behavior of JMP, I would guess that the signal line that is tied to JMP abs alone is the timer reset, which also loads the 16-bit address buffer to PC. It would then follow that the fourth cycle reads the low byte of the new PC value from that address and adds one to the buffer, but does so 8-bit only, then the cycle afterward, JMP timer cycle 5, behaves like the JMP abs only cycle. If this is the case, PC is still only changed when the instruction ends, probably because no one thought to change PC during instruction execution and using the natural behavior of PC to handle the indirect JMP memory read.

Since asking this question, I've studied the 6502 a bit more using visual 6502 and I find it curious how much logic seems to be dedicated to particular combinations of instruction and addressing mode. I wonder if there would have been any design problem with saying that every instruction with 00 in the bits 1..0 is a single-byte instruction, every remaining instruction with 000 in bits 4..2 is an immediate-mode operation, branch, or JSR, and remaining instruction would start by computing an effective address (ignoring everything but bits 4..2), and then performing an operation on that... — supercat, Aug 22 '22 at 21:44
...effective address (selected by the remaining five bits). I would think that treating e.g. ADC $1234,X as a combination of a "fetch from address $1234,X" semi-instruction followed by a "perform ADC with addressed value" semi-instruction would avoid the need to have different logic handle the EA calculation for e.g. "LDA $1234,Y" and "LDX $1234,Y". — supercat, Aug 22 '22 at 22:00
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From Review — Raffzahn, Aug 22 '22 at 22:32
@Raffzahn, I wasn't asking anything, I was just responding to supercat's comment about the possibility of an interrupt intercepting an instruction in his earlier answer. I don't have enough reputation to add comments to anything but my own answers yet, so I responded with an answer. Sorry if that breaks any rules. — G David, Aug 23 '22 at 16:33
@GDavid The text is what the system inserts when a post gets marked as being a comment. As hard as it may seem, you may need to wait and build up some rep before going into discussions - if at all, as comments aren't really meant for that either. — Raffzahn, Aug 23 '22 at 22:16

score 0 · Answer 3 · answered Oct 16 '18 at 21:49

0

This is code of sim65:

/* Opcode $6C: JMP (ind) */
{
    unsigned PC, Lo, Hi;
    PC = Regs.PC;
    Lo = MemReadWord (PC+1);
if (CPU == CPU_6502)
{
     /* Emulate the 6502 bug */
    Cycles = 5;
    Regs.PC = MemReadByte (Lo);
    Hi = (Lo & 0xFF00) | ((Lo + 1) & 0xFF);
    Regs.PC |= (MemReadByte (Hi) << 8);

    /* Output a warning if the bug is triggered */
    if (Hi != Lo + 1)
    {
        Warning ("6502 indirect jump bug triggered at $%04X, ind addr = $%04X",
                 PC, Lo);
    }
}
else
{
    Cycles = 6;
    Regs.PC = MemReadWord(Lo);
}

}

answered Oct 16 '18 at 21:49

Polluks

465
3
7

I think the 65C02 version should include a MemReadByte(something) before the MemReadWord(Lo), since it will perform a dummy read cycle where the NMOS 6502 doesn't. – supercat Oct 16 '18 at 21:52
@supercat does it definitely perform a dummy read there? I would have guessed i) opcode; ii) operand low; iii) operand high; iv) low byte of address; v) byte from where the 6502 would read it; vi) byte from the proper place, which may or may not be the same as (v). Just from the general tenor of the 6502. – Tommy Oct 16 '18 at 23:32
The 65C02 takes six cycles to perform a JMP (ind), and every cycle is either a read or a write. I don't remember where the 65C02 puts the extra cycle, and what address it reads, but the extra cycle means there must be a dummy read sometime. – supercat Oct 16 '18 at 23:46
The 65C02 also adds JMP (ind,X) which is also unconditionally 6 cycles. Logically, JMP (ind) could be implemented using the same microcode but with the index forced to zero. The 65816 has JMP (ind) with 5 cycles but JMP (ind,X) still takes 6; in the latter, the "internal operation" cycle is the fourth, just before reading the indirected address. It's reasonable to assume both modes do it this way on the 65C02. – Chromatix Jun 29 '19 at 02:22

How are 6502 and 65C02 JMP(abs) processed internally

3 Answers3