110

When reading some other questions about compiling C for the Z80,

I am getting the impression that it is hard to compile C to Z80 and end up with well-optimised code. Is that the case, and why?

I know more about the 6502. Here are some examples that shows why C fits 6502 badly:

  • An array in C is indexed by an integer type. The 6502 is pretty quick at indexing arrays, but unfortunately only if the index is one byte wide. So something like strcmp or strlen might need to actually do a 16-bit add per character.

  • A stack is an ideal data structure for passing function parameters. The stack's limited to 256 bytes and the 6502 has rather limited stack addressing modes as compared to the PDP-11, so cc65 uses a second stack, implemented in software, to pass parameters IIRC.

From what I understand of the Z80 these two examples do not apply. Z80 has these index registers and a much roomier stack. So what are the reasons C fits badly?


Summary of answers given thus far

Omar and Lorraine
  • 38,883
  • 14
  • 134
  • 274
  • When I saw my post in the wild, I realised I gave it a rather clickbaity title. It was not my intent! So if anyone has a better suggestion I'm all ears. – Omar and Lorraine Mar 28 '18 at 11:47
  • 2
    The big problem with a Z80 is there is no index register. You have to compute the address then do an xchg. Anything to do with pointers, dereferencing or arrays will have this problem. – cup Mar 28 '18 at 11:56
  • 3
    @cup right; the Z80 has IX & IY, but they can only be used with constant byte offsets, which means if you want to use them to index an array the array has to be defined in the first 256 bytes of memory, which is tricky to arrange in a C program... particularly as the Z80's startup address is 0 so many platforms have ROM in the first 256 bytes. – Jules Mar 28 '18 at 12:00
  • 17
    Your statement An array in C is indexed by a type the same width as a pointer. is wrong. It is an integer constant greater than zero and type can be any integer type - This includes bytes. – tofro Mar 28 '18 at 16:31
  • 9
    The A stack is an ideal data structure for passing function parameters statement is misleading as well - stacks are ideal to pass function parameters, yes. But C doesn't require its stack to be the same thing as the CPU stack. Just by coincidence, most platforms choose it to be, because it is convenient. You call that a "software stack (whatever that might be) - If you think a software stack is something where you have to "push" something to in software - The CPU stack is the same. – tofro Mar 28 '18 at 16:33
  • 9
  • 11
    It must also be noted that you can't judge modern compilers against the older compilers. The older compilers were designed to run ON the Z80 with no memory, glacial CPUs and even worse FLOPPY disk drives. It's all they can do to produce crummy code, much less good code. Modern compilers have "unlimited" RAM, "unlimited" CPU, and "instant" persistent storage in comparison. I used a C compiler on the Atari 800 -- once. What a miserable experience. – Will Hartung Mar 29 '18 at 20:33
  • Comments are not for extended discussion (regardless of how on-topic or interesting); this conversation has been moved to chat. – wizzwizz4 Apr 01 '18 at 10:28
  • 1
    @WillHartung Yep, I was asking about compilers generally, but not really about old ones running on Z80s because of the reasons you suggest, which it could be argued are not really about compilers and Z80s, but about what you can do on a constrained system. – Omar and Lorraine Apr 04 '18 at 11:39
  • 2
    If I were to judge a question based on the quality of the answers it has generated, I'd have to vote this question up two or three times. – Wayne Conrad Jun 19 '18 at 22:53
  • Irrelevant to the discussion but as I read @cup answer, when I got to xchg, my brain triggered a smell from the 1980s when I was coding on a friends ZX Spectrum. – Neil May 12 '20 at 15:54
  • @WillHartung That experience is important to remember why Turbo Pascal was such a groundbreaking product. You could do the complete compile cycle in memory – Thorbjørn Ravn Andersen Nov 26 '20 at 11:05
  • There is a port of Clang to the Z80 that is able to optimise this into a memset, but then it seems to produce suboptimal memset code. https://github.com/jacobly0/llvm-project – Heath Mitchell Feb 16 '22 at 10:01
  • @HeathMitchell, excuse me, What do you mean by "optimise this into a memset"? – Omar and Lorraine Feb 16 '22 at 14:40
  • @OmarL The loop becomes "call void @llvm.memset.p0i8.i16(i8* noundef nonnull align 1 dereferenceable(10) getelementptr inbounds ([10 x i8], [10 x i8]* @c, i16 0, i16 0), i8 0, i16 10, i1 false)" in LLVM IR – Heath Mitchell Feb 16 '22 at 16:46
  • @OmarL Oops, I meant to reply to the answer with the loop example – Heath Mitchell Feb 16 '22 at 16:47

10 Answers10

104

Quite often people don't know how to use the compilers or don't understand fully the consequences of code they write. There is optimization going on in the Z80 C compilers but it's not as complete as, say, GCC. And I often see people fail to turn up the optimization when they compile.

There is an example here in introspec's answer:

char i,data[10];

void main(void) 
{
  for (i=0; i<10; i++)
    data[i]=0;
}

There are lots of problems with this code that he is not considering. By declaring i as char, he's possibly making it signed (that is the compiler's discretion). That means, in comparisons, the 8-bit quantity is sign extended before being compared because normally, unless you specify in code properly, the C compiler may promote to int before doing those comparisons. And by making it global, he makes sure the compiler cannot hold the for-loop index in a register inside the loop.

There are two C compilers in z88dk. One is sccz80 which is the most advanced iteration of Ron Cain's original compiler from the late 1970s; it's mostly C90 now. This compiler is not an optimizing compiler - its intention is to generate small code instead. So you will see many compiler primitives being carried out in subroutine calls. The idea behind it is that z88dk provides a substantial C library that is written entirely in assembly language so the C compiler is intended to produce glue code while the execution time is spent in hand-written assembler.

The other C compiler is a fork of sdcc called zsdcc. This one has been improved on and produces better & smaller code than sdcc itself does. sdcc is an optimizing compiler but it tends to produce larger code than sccz80 and overuses the Z80's index registers. The version in z88dk, zsdcc, fixes many of these issues and now produces comparable code size to sccz80 when the --opt-code-size switch is used.

This is what I get for the above when I compile using sccz80:

zcc +zx -vn -a -clib=new test.c

(the -O3 switch is for code size reduction but I prefer the default -O2 most of the time)

._main
    ld  hl,0    ;const
    ld  a,l
    ld  (_i),a
    jp  i_4
.i_2
    ld  hl,_i
    call    l_gchar
    inc hl
    ld  a,l
    ld  (_i),a
    dec hl
.i_4
    ld  hl,_i
    call    l_gchar
    ld  de,10   ;const
    ex  de,hl
    call    l_lt
    jp  nc,i_3
    ld  hl,_data
    push    hl
    ld  hl,_i
    call    l_gchar
    pop de
    add hl,de
    ld  (hl),#(0 % 256)
    ld  l,(hl)
    ld  h,0
    jp  i_2
.i_3
    ret

Here you see the subroutine calls for compiler primitives and the fact the compiler is forced to use memory to hold the for-loop index. l_lt is a signed comparison.

A zsdcc compile with optimization turned up:

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  hl,_i
    ld  (hl),0x00
l_main_00102:
    ld  hl,(_i)
    ld  h,0x00
    ld  bc,_data
    add hl,bc
    xor a,a
    ld  (hl),a
    ld  hl,_i
    ld  a,(hl)
    inc a
    ld  (hl),a
    sub a,0x0a
    jr  C,l_main_00102
    ret

By default char is unsigned in zsdcc and it's noticed that the comparison i<10 can be done in 8 bits. C rules say both sides should be promoted to int but it's okay not to do that if the compiler can figure out the comparison can be equivalently done another way. When you don't specify that your chars are unsigned, this promotion can lead to insertion of sign extension code.

If I now make the char explicitly unsigned and declare i inside the for loop:

unsigned char data[10];

void main(void) { for (unsigned char i=0; i<10; i++) data[i]=0; }

sccz80 does this:

zcc +zx -vn -a -clib=new test.c

._main
    dec sp
    pop hl
    ld  l,#(0 % 256)
    push    hl
    jp  i_4
.i_2
    ld  hl,0    ;const
    add hl,sp
    inc (hl)
.i_4
    ld  hl,0    ;const
    add hl,sp
    ld  a,(hl)
    cp  #(10 % 256)
    jp  nc,i_3
    ld  de,_data
    ld  hl,2-2  ;const
    add hl,sp
    ld  l,(hl)
    ld  h,0
    add hl,de
    ld  (hl),#(0 % 256 % 256)
    ld  l,(hl)
    ld  h,0
    jp  i_2
.i_3
    inc sp
    ret

The comparison is now 8-bit and no subroutine calls are used. However, sccz80 cannot put the index i into a register - it does not carry enough information to do that so it instead makes it a stack variable.

The same for zsdcc:

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  bc,_data+0
    ld  e,0x00
l_main_00103:
    ld  a, e
    sub a,0x0a
    ret NC
    ld  l,e
    ld  h,0x00
    add hl, bc
    ld  (hl),0x00
    inc e
    jr  l_main_00103

Comparisons are unsigned and 8-bit. The for loop variable is kept in register E.

What about if we walk the array instead of indexing it?

unsigned char data[10];

void main(void) { for (unsigned char p = data; p != data+10; ++p) p = 0; }

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  bc,_data
l_main_00103:
    ld  a, c
    sub a,+((_data+0x000a) & 0xFF)
    jr  NZ,l_main_00116
    ld  a, b
    sub a,+((_data+0x000a) / 256)
    jr  Z,l_main_00105
l_main_00116:
    xor a, a
    ld  (bc), a
    inc bc
    jr  l_main_00103
l_main_00105:
    ret

The pointer is held in BC, the end condition is a 16-bit comparison and the result is the main loop takes about the same amount of time.

Then the question is why isn't this done with a memset()?

#include <string.h>

unsigned char data[10];

void main(void) { memset(data, 0, 10); }

zcc +zx -vn -a -clib=sdcc_iy -SO3 --max-allocs-per-node200000 test.c

_main:
    ld  b,0x0a
    ld  hl,_data
l_main_00103:
    ld  (hl),0x00
    inc hl
    djnz    l_main_00103
    ret

For larger transfers this becomes an inlined ldir.

In general the C compilers cannot currently generate the common Z80 CISC instructions ldir, cpir, djnz, etc but they do in certain circumstances as shown above. They are also not able to use the exx set. However, the substantial C library that comes with z88dk does make full use of the Z80 architecture so anyone using the library will benefit from assembly-level performance (sdcc's own library is written in C so is not at the same performance level). However, beginner C programmers are usually not using the library either because they're not familiar with it and that's on top of making performance mistakes when they don't understand how the C maps to the underlying processor.

The C compilers are not able to do everything, however they're not helpless either. To get the best code out, you have to understand the consequences of the kind of C code you write and not just throw something together.

Toby Speight
  • 1,611
  • 14
  • 31
aralbrec
  • 1,156
  • 1
  • 7
  • 5
  • 16
    This is a fantastic answer; it's better than most of the other answers here, providing full and accurate information and demonstrating that the premise of the question is partially flawed as opposed to just asserting it. This is a great first contribution. – wizzwizz4 Mar 30 '18 at 19:17
  • 4
    Lovely answer! I'd like to add here that I made my variable global specifically to confirm to the recommendations on z88dk website: item 2 at https://www.z88dk.org/wiki/doku.php?id=optimization I am not using memset intentionally because there is no ready-made memset for every small loop that you write, so it is the generic behaviour on compiler on small loops that concerns me. – introspec Mar 31 '18 at 08:45
  • 9
    And by making it global, he makes sure the compiler cannot hold the for-loop index in a register inside the loop. Again this is purely a limitation of compilers that don't know how to optimize well. It's not volatile, and the compiler can prove the stores into data[] don't alias it (because it's also a global array, not a pointer, and the compiler knows that two globals don't overlap each other). So the compiler is allowed to sink the stores to the counter out of the loop and do one store of 10 after the loop. The "as-if" rule allows compile-time reordering of loads/stores. – Peter Cordes Mar 31 '18 at 19:54
  • 7
    But well spotted, that is a seriously bad way to write code that makes life difficult for compilers. It's disappointing (but not too surprising considering their age) that real Z80 compilers can't do that optimization, or turn simple array indexing into pointer increments. gcc could turn the loop into a memset call and/or inline known good memset code :P – Peter Cordes Mar 31 '18 at 19:56
  • Yes the optimizations are not at the same level as something like gcc but the point is they aren't absent either. The kind of code you write does influence the quality of code generated because of this. The construction of your code can provide hints to the compiler on how to best produce code, this includes less-used constructions like declaring {} blocks inside functions. These {} blocks indicate to the compiler when local variables outside the block may go unused and gives it permission to allocate to registers other things inside the {} block. – aralbrec Apr 01 '18 at 02:40
  • @introspec z88dk has been around for many years so unfortunately a lot of documentation is also dated and often outdated. For sccz80, putting i in static memory may lead to better code because it can't allocate something that lives long to a register. The stack variable it instead creates as local requires sequences like "ld hl,n; add hl,sp; ld l,(hl)" to access the value whereas in static memory it might be something like "ld hl,(_i); ld h,0" or "ld a,(_i)". The main issue was the var declared as char which is signed and leads to a 16-bit signed compare to end the loop due to promotion. – aralbrec Apr 01 '18 at 02:45
  • @aralbrec Precisely, I actually created local variable at first (as one does normally) but the stack manipulations were indeed horrendous. – introspec Apr 01 '18 at 10:35
  • 3
    @aralbrec However, if you think about this, you have better knowledge of the compiler, lots of tricks undocumented on the official site and the best you can do is still over 60 t-states per byte. If this doesn't illustrate my point, I don't know what does... – introspec Apr 01 '18 at 10:41
  • @introspec The only fault is that the compiler is not able to use djnz/ldir except in the memset case. The loop construction is otherwise not bad. The other thing missed is the compiler doesn't convert the array indexing to pointer, which was the second to last try above and is a bit faster. In a real situation you would have more code in the loop that may cause a human's register allocation to be different and the loop code to be more similar to the compiler's. I don't dispute that the compiler is not expert level but it can be intermediate level if the c code is written with care. – aralbrec Apr 01 '18 at 15:26
  • @aralbrec Actually accesses of local variables on the stack can be done pretty cleanly and easily using one of the index registers as a frame pointer. On entry to the function, you push the existing frame pointer - e.g. ix onto the stack, load the frame pointer with the stack pointer and then advance the stack pointer enough bytes for local variables. On exist, you load the stack pointer from the frame pointer and then pop the saved frame pointer. For a short function, the overhead may not be worth it, of course. – JeremyP Apr 04 '18 at 16:01
  • @JeremyP sdcc/zsdcc already does that but sccz80 chooses to use relative stack addressing instead hoping to access a few variables with pop. The problem is indexed addressing on the z80 requires an extra opcode byte which makes the code slower and larger. Hand written z80 programs rarely use the index registers in this way for this reason, instead preferring to order data structures so that elements can be accessed in sequence instead of randomly. sdcc/zsdcc tries to hold a working set of variables in registers with backing in ix memory with success dropping with function size. – aralbrec Apr 05 '18 at 16:55
  • 1
    By choosing indexed registers automatically you are already behind hand-written asm. In addition to the compiler trying to avoid using indexed addressing (with some limited success in small blocks) there is post-processing to try to eliminate the worst cases. In addition to that there are two other calling conventions (fastcall and callee) that can instruct the compiler to pass parameters in registers or via a better stack method to functions implemented in asm (that would be the library and user asm code). In this last case indexed addressing is avoided entirely for function params. – aralbrec Apr 05 '18 at 16:59
  • sdcc/zsdcc also allows a "preserves_registers" attribute to be attached to asm functions that tells it what registers are unchanged by the called function. This allows the compiler to keep state in those unused registers around the function call and this improves to generated code too. The library functions in z88dk are all written using this attribute and user asm code can be too; this is another reason to prefer library code if possible. – aralbrec Apr 05 '18 at 17:04
  • @aralbrec: I find it curious that compilers on the Z80 didn't have an option to specify that functions wouldn't be used recursively, and have the linker overlay automatic objects as was done on compilers for the 8051 and PIC. Something like i++; where i is a static variable would be 7 bytes and 38 cycles using ld hl,(addr) / inc hl / ld (addr),hl, but 13 bytes and 82 cycles using ld l,(iy+disp) / ld h,(iy+disp+1) / inc hl / ld (iy+disp),l / ld (iy+disp+1),h. A major savings with every access, as well as the elimination of the function prologue/epilogue that was otherwise needed. – supercat Jun 20 '18 at 15:52
  • @supercat The compiler can beat that by keeping values in registers so automatically doing it for all such variables can do a disservice to performance and code size. – aralbrec Jun 21 '18 at 19:06
  • @aralbrec: A Z80 compiler that doesn't want to use undocumented opcodes might benefit from using BC to hold one 16-bit register-qualified variable, but otherwise HL would need to be kept clear for addresses and DE for temporary 16-bit operands, so all other variables would need to be kept in memory. – supercat Jun 21 '18 at 19:56
  • @supercat Contrary to much of the discussion here, code generation for the z80 is not terrible. It's better than a novice and likely better than an intermediate programmer who has difficulty writing big programs. You do not see a lot of references to (ix+n) inside loops as long as code is not overly complicated. You can have a look at some of the code the c compiler is producing on this site which has the code for a real spectrum game: https://bitbucket.org/CmGonzalez/gandalf/src/master/ The *.c.lis files contain the asm generated by the compiler for the corresponding c file. – aralbrec Jun 22 '18 at 03:29
  • Like the answer a lot, and have seen worse assembly code that the last example... – tofro Jul 16 '18 at 10:57
  • @aralbrec: Your example suggests that your compiler can optimize one value to BC, but automatic objects are accessed with the sequence mov hl,offset / add hl,sl followed by an access to the object at (HL). The (IX+n) indexing is a little less inefficient than that, but on your compiler it looks like good performance requires avoiding using more than one automatic-duration object. – supercat Jun 05 '21 at 16:37
66

If you try translating C into Z80, you'll see that Z80 index registers and stack don't behave quite as you expect. So, let us begin with

Arrays

Suppose you have a standard C construction

int c[10];
for (int i=0; i<10; i++)
    c[i]=0;

Your compiler is pretty much required to use 16-bit value for i. So, you have &c somewhere, maybe even in your index register!, so let us have IX=&c. However, the operations with index registers only allow constant offsets, which are also single signed bytes. So, you do not have a command to read from (IX+16-bit value in a register). Thus, you would end up using things like

ld ix,c_addr            ; the array address
ld de,(i_addr)          ; the counter value
add ix,de
ld a,0
ld (ix+0),a             ; 14+20+15+7+19 = 75t (per byte)

Most compilers will output code that is pretty close to what I wrote. Actually, experienced Z80 programmers know - IX and IY are hopeless for most operations with memory - they are far too slow and awkward. A good compiler writer would probably make his/her compiler do something like

ld hl,c_addr            ; the array address
ld de,(i_addr)          ; the counter value
add hl,de
ld a,0
ld (hl),a               ; 10+20+11+7+7 = 55t (per byte)

which is 25% faster without breaking a sweat. Nevertheless, this is far from great Z80 code even though I made my i variable static to make my - and the compiler's - life easier!.

A good Z80 programmer would simply write the equivalent loop as

         ld hl,c_addr
         ld b,10
         xor a
loop:    ld (hl),a
         inc hl
         djnz loop

The actual full loop would take (7+6+13)*10-5 = 255/10 ~ 25.5 t-states per byte. And this is really not optimized code, this is a kind of code one writes where optimization does not matter. One can do partial unrolling, one can make sure that array c does not cross 256 byte boundaries and replace INC HL by INC L. The fastest filling is actually done using the stack. In other words, Z80 does not fit the C paradigm.

Of course, one can write a similar loop in C (using a pointer instead of an array, using countdown loop instead of counting up), which would then increase chances of it being translated into a decent Z80 code. However, this would not be you writing a regular C code; this would be you trying to work around limitations of C when it is meant to be translated into Z80.

Let me give you another example.

Local variables.

Raffzahn is correct when he says that one does not have to use stack for local variables. But there must be a stack of some kind if you want recursive functions. So let us try to do it the PC way, via the stack. How do you implement a call to something like

int inc(int x) {
  return x+1;
}

Suppose even that current value for x is in one of your registers, say HL. So, you'd have something like

push hl
call addr_inc
...

How do we actually recover the address (and value) of x? It is stored at SP+2. However, we have to be careful with SP, because we want to return back to the calling program, so maybe we do something like

addr_inc:   ld hl,2
            add hl,sp
            ld e,(hl)
            inc hl
            ld d,(hl)               ; 10+11+7+6+7 = 41t

Now we have x in DE. You can see how much work this was.

So, when people complain about C compilers for Z80, they do not mean it would not be possible to do. It is something else entirely. In any kind of programming, there are patterns, some are good, some are not so good. My point is, a lot of things that C does are simply bad patterns from the point of view of Z80 coding. One simply does not do things on Z80 that C pretty much requires you to be fluent at.

introspec
  • 4,172
  • 1
  • 19
  • 29
  • 5
    IDK about Z80 but if the compiler uses 16-bit for such i values then it's a garbage compiler. Most modern compilers for 8-bit microcontrollers know to optimize for those cases when you don't take i's address – phuclv Mar 29 '18 at 04:30
  • 1
    It is easier on modern mirocontrollers (which one do you have in mind btw?), because they tend to have more general purpose registers. On Z80 there is a lot of specialization for almost every register, so one simply have to exploit those, but explaining such heuristics to the compiler is very, very hard. – introspec Mar 29 '18 at 05:27
  • Nothing really hinders a compiler writer using your proposed constructs - They just don't/didn't do it because it would make the compiler way more complicated. – tofro Mar 29 '18 at 06:47
  • 1
    The 8051 has only 1 register which is the accumulator (not counting SFR). PIC also has 1, along with a register file. Among the common 8-bit MCUs only AVR has 32 registers – phuclv Mar 29 '18 at 08:43
  • 20
    Re, "Your compiler is pretty much required to use 16-bit value for i." Simply not true. Any modern compiler would be smart enough to know that the values of i in your example all fall in the range 0..9, and any modern compiler would be smart enough to allocate whatever register was the most appropriate to hold those values and use them as array indices. The only question is, whether any compiler exists with that much smarts, and the ability to target the Z80. – Solomon Slow Mar 29 '18 at 14:13
  • 1
    @jameslarge - right. Also worth noting that modern compilers are perfectly able to analyze a program to discover whether a function actually is called recursively, and optimize it differently accordingly. In fact, it would even be simple enough to partition functions into sets that to do not need to be reentrant with each other (assuming a single thread) and therefore can share the same statically allocated memory for their internal variables. I'm not sure if any do because that's not a common enough requirement today that it would be worth anyone doing it, but it wouldn't be hard. – Jules Mar 29 '18 at 17:14
  • 2
    @Jules Some C compilers I know for the Z80 have a #pragma rec and #pragma norec that would instruct the compiler to either hand over parameters on the stack or in registers (the first some, at least) – tofro Mar 29 '18 at 18:45
  • 8
    I find it curious that no C compilers I know of for the Z80 or 6502 have an option to handle local variables the way PIC and 8051 compilers do--by statically allocating them so that variables that may be used simultaneously get different addresses, but those whose lifetimes don't overlap can be overlaid. The logic isn't hard, and it could greatly improve the efficiency of generated code for functions that don't need to support recursion or re-entrancy. – supercat Mar 29 '18 at 18:46
  • Of course. Although the true fast option is ld sp,data+10 : ld de,0 : push de : push de : push de : push de : push de – introspec Mar 29 '18 at 23:12
  • 9
    Compilers already know how to turn array-indexing into pointer-increments, and do so to save a register, and to reduce the size of the instruction on x86 (where an index takes an extra byte). Also other advantages, like not breaking micro-fusion on Sandybridge-family or being able to use the port7 AGU on Haswell for stores. It's entirely reasonable to expect a compiler to make a loop like your inc hl / djnz loop for this case where the trip-count is a compile-time constant. Somewhat reasonable otherwise. – Peter Cordes Mar 30 '18 at 03:51
  • I added an example of how modern Z80 compilers do this kind of loops as a separate answer below, just to illustrate my point. – introspec Mar 30 '18 at 07:04
  • @introspec: A compiler could only use SP in that fashion if it knew there was no possibility of an untimely interrupt or NMI. Unless a compiler has an "I promise no interrupts will occur here" directive, I see no way a compiler could possibly know that. – supercat Mar 31 '18 at 23:37
  • @supercat True, but a lot of great Z80 code is using stack in similar ways. To me this is simply one more reason why I do not believe I'll ever see great Z80 code from a C compiler. – introspec Apr 01 '18 at 10:31
  • @supercat, in fact, there are even assembly-based techniques to use stack similarly to how I used it and simultaneously allow interrupts to happen. I'd require convention on the register use which is not all that impossible for the compiler to implement, esp. since compilers do not tend to be all that good at using registers anyway. – introspec Apr 01 '18 at 12:53
  • @introspec: In many systems a compiler won't have control over everything that's running. Some particular CP/M machines, for example, might rely upon an interrupt routine stored in ROM to scan the keyboard. An ordinary CP/M compiler would have no way of knowing about such things unless it targeted a very particular machine. – supercat Apr 01 '18 at 14:27
  • I think this only reinforces your point but if int is 16 bits, your code needs to multiply the value in de by 2 (or shift it left 1 bit) before adding it to hl. If the array was an array of structs whose size is not a power of 2, well, you have a full on multiplication to do. – JeremyP Apr 04 '18 at 13:16
  • 7
    @phuclv Most modern compilers for 8-bit microcontrollers know to optimize for those cases when you don't take i's address -- modern 8 bit microcontrollers typically have somewhere between 32 and 128 general purpose registers. The Z80 has 6(ish), and 2 of those basically have to be reserved for use as a pointer for almost all nontrivial code. This gives compilers for those architectures a lot more scope to optimize. – Jules Jun 19 '18 at 21:40
  • 1
    @Jules 8051, AVR and PIC all have only 1 register as accumulator. Of course the latters have a register file but still not as good as a register block like normal processors. Anyway how what you said is relevant here? The compiler simply needs to check if the variable range fits in a char and then it wastes one less register and has much less work to do – phuclv Jun 20 '18 at 00:57
  • 1
    You're right about IX and IY - they are better suited as frame pointers or for accessing structure members (in both cases needing only fixed offsets). I suspect that's why the indexing registers are provided. – Toby Speight Jun 20 '18 at 12:48
  • 2
    @TobySpeight: I suspect IX and IY may have been architected at a time when the Z80 was expected to have an 8-bit ALU that wouldn't take 5 cycles to perform an effective address calculation. For the design to be efficient with a 4-bit ALU, it needs to start performing the effective address calculation before fetching the primary opcode. That might have been accomplished if e.g. DD mapped DE and HL to IX and IY, and FD fetched a follow-on displacement byte, added +/- 64 to either IX or IY (selected using a bit of that byte) while fetching the next instruction, and then used that... – supercat Jun 20 '18 at 15:14
  • 2
    ...in place of any address [either SP or HL] used in that instruction. On the Z80, trying to load DE with (IY+n)--a common operation if IY is a frame pointer--would take six bytes of code (hardly great) but 38 cycles (gaaak!). Having FD fetch the displacement before the next opcode byte would have saved 8 cycles off that, and an ability to modify an instruction that loads or stores 16 bits (e.g. a push or pop) would have cut the sequence to three bytes and 18 cycles--a better than 50% savings, making IY much more useful as a frame pointer. – supercat Jun 20 '18 at 15:23
  • 1
    Another detail which makes me suspect that the architects of the Z80 expected it to include a better ALU is the use of 8-bit signed displacements. With a faster ALU, processing an indexed addressing more or relative jump with an 8-bit displacement could have had a speed advantage vs using a 16-bit displacement, but the ALU was slow enough to essentially negate any such advantage. – supercat Jun 20 '18 at 15:40
  • 1
  • @supercat - I seem to remember the Avocet compiler did overlaid variable allocations, but maybe I'm mistaken. – Jeremy Nov 29 '19 at 13:29
  • 1
    @Jeremy: I don't know about many compilers from that era, but I think the C Standard did a disservice to 8-bit platforms by regarding recursion as a "required", rather than highly-recommended feature. Under the One Program Rule, an implementation that correctly processes one suitably-contrived program that uses recursion but would behave identically if recursive calls were ignored, it would not have to handle any other programs that use recursion, so there's no real requirement that recursion be handled usefully. If the Standard had indicated that quality implementations... – supercat Nov 29 '19 at 19:02
  • ...should support recursion when practical, that would have allowed 8-bit implementations to generate much more efficient code. If Avocet used overlaying for variables, good for them. Any idea what tools from that era would be freely distributable today? – supercat Nov 29 '19 at 19:04
  • The transformation you described - converting the loop index to a pointer - is stock and trade of modern code generators, and it has little to do with Z80 specifically, and everything to do with the repertoire of transformations the code generator can use when optimizing the code. Such transformations are usually general and will be used in any situation where they produce better code, on any target. – Kuba hasn't forgotten Monica Mar 25 '20 at 02:30
  • For a fast memory clear, I always used to set the first byte of the area to zero, set HL to the start, DE to the start plus 1, BC to count -1 then use LDIR – Nick Craig-Wood Sep 25 '20 at 13:18
  • @NickCraig-Wood, yes I often do the same. It is a good size-optimized solution that is faster than probbaly any loop structurally similar to the for-loop we used as the starting point. – introspec Sep 25 '20 at 16:10
  • @Kubahasn'tforgottenMonica: Ironically, on some platforms gcc and clang are prone to convert code that uses a loop index into code that uses a marching pointer, even if the loop index code would have been more efficient. – supercat Jun 07 '21 at 19:13
59

The main downside of "historic" CPU's (non?)-suitability for C programs is the lack of capability to form more than one register into an address without using the ALU.

Most more modern CPUs can use base + index + offset register addressing modes to address complex data structures like arrays and structures - The Z80 needs to painstakingly go through the 4-bit ALU to add an offset + an index to a base register like HL - most modern CPUs use separate address calculation instances for the various addressing modes.

Another reason is the lack of real multipurpose registers - You simply cannot do everything with every register in the Z80 - Its pure register count is somewhat impressive, but using the alternate register set is probably too complicated for a compiler, and thus the possible choice of registers for a compiler is limited. This is even more valid for the 6502 that has even fewer registers.

Yet another downside is: You can't get a decently modern C compiler for the Z80 - clang or GCC with their aggressive optimizers don't bother for this old CPUs, and hobbyists' produces are just not that sophisticated. Even if you could, GCC and clang concentrate to optimize for code locality, something a CPU without a cache can't even benefit from, but really boosts a modern CPU.

I personally don't think (even non-optimal) compilers would be useless for old CPUs - There is always a lot of stuff in a program that isn't fun to do anyhow and just tedious to write in assembler (and after all, the only reason why we would still do this would be fun, wouldn't it?) - So I tend to write the boring, non-time-critical parts of a program in C, the other, the "fun" part in assembly. Perfect of both worlds.

tofro
  • 34,832
  • 4
  • 89
  • 170
  • 6
    I did just that in a Z80 (Spectrum and others) game I wrote. The core gameplay was in assembler, stuff like the leaderboard and help logic was in C. – Rich Mar 29 '18 at 02:54
  • 1
    Most more modern CPUs can use base + index + offset register addressing I don't think so. That doesn't apply to modern architectures like ARM, MIPS, PowerPC, Sparc... – phuclv Mar 29 '18 at 04:26
  • 3
    @LưuVĩnhPhúc What do you consider LDRLS x,[r1,r0,LSL #2] then (ARM)? – tofro Mar 29 '18 at 05:14
  • 4
  • 5
    ARM is not really a RISC ISA. It's somewhat RISCy, or shares some of their features, like fixed-width instructions (except Thumb2...), but an ISA with an instruction that does anywhere from 1 to 16 loads or stores depending on bits in a bit-field in the instruction is not a RISC. (I'm talking about ARM's push {r4, r5, r6, ..., lr} aka STMDB and corresponding pop instruction. The load/store-multiple instructions are microcoded because they're too complex and do a variable amount of work. – Peter Cordes Mar 30 '18 at 03:22
  • 4
    Being a load-store architecture with fixed-width instructions is necessary but not sufficient to really be fully RISC. Not every non-x86 architecture is RISC. ARM definitely doesn't fall neatly into the CISC category either, but it's not RISC. DarkShikari (x264 lead developer for several years / asm expert) argues this pretty well: https://www.reddit.com/r/programming/comments/8j25z/what_are_the_disadvantages_to_an_arm_chip/c09fx2h/, saying "ARM was RISC... a long, long time ago." (but the ARM ISA has evolved and grown). – Peter Cordes Mar 30 '18 at 03:28
  • 9
    @PeterCordes Prety nice example when a dogma (RISC) collieds with reality - and reality wins ... except ofc, with the dogmas priests. – Raffzahn Mar 30 '18 at 12:35
  • @Raffzahn: exactly. ARM is not RISC, but it has the good kind of complexity, which is not too hard for CPU designers to implement and which lets you get a lot done with fewer instructions. They dropped some of it for AArch64, though; e.g. no longer spending 4 bits per insn to predicate every instruction on flags. And no longer exposing the program counter as one of the general-purpose registers. And replacing store/load-multiple with store/load pair (stp / ldp). So they kept the feature of having instructions that write two integer registers; x86 CPUs take 2 uops for those (e.g. mul) – Peter Cordes Mar 30 '18 at 16:42
  • 2
    Being RISC or not is beside the point. Lưu Vĩnh Phúc was asserting that ARM (which xe called out specifically) cannot use base+index+offset register addressing, contradicting where the answer said that "modern CPUs" (well, "most more modern CPUs") could. – JdeBP Mar 31 '18 at 09:40
  • ... and that was wrong anyways. See my direct answer in the comment. – tofro Nov 04 '22 at 10:40
43

Simple answers one easily gets to this question are The Z80 Sucks and C Sucks - depending on the side someone is on. While they are of course, untrue (*1), there are real issues. A major argument for both sides is that

  • C is at core tied to a PDP-11(ish) CPU architecture and the Z80 isn't one.

  • The Z80 is a rather special CPU, created with a focus on maxing abilities, not beauty.

  • C is a language without, or at best a very minimum runtime (*2).

All these points are linked. Like the question mentioned, C implies a simple and rather symmetric pointer model which is originated in what the PDP-11 offered. This includes the direct conversion to a memory address which in turn allowed to skip the creation of a more sophisticated data model and the use of pointers to realize functions that would otherwise be handled by some language runtime.

Now the Z80 is (like its predecessor, the 8080) quite able to perform everything needed. Due to its (inherited) structure of a single memory pointer it does, however, need to replace a single (PDP-11 based) C-operation with several machine instructions. So far not a real issue. Except, when an assembly programmer looks at the result, he immediately sees Z80 specific ways to improve the result - like holding two pointers and exchanging HL/DE when needed. That's hard to 'understand' for a C compiler, as it is based on semantics - the knowledge 'why' something is done - not just being told 'how' it's done.

It is not strictly a C problem,

but an issue with all high-level languages. They compile best to a simple symmetric CPU model with a set of equal resources, offering exactly the operations the abstraction layer needs. The higher the language's abstraction is, the better the underlying 'CPU' level can perform. That's why the UCSD P-Code System did perform so well across many platforms. The offering of its virtual CPU was exactly what a compiler wants. Despite being an interpreter at the core, performance was, on many machines, comparable to native code generated from the same language source. The reason for this platform optimization lies within the interpreter. Here, each rather abstract function gets performed by optimized routines. A string move might have the same invocation (due to the P-Code) across all platforms, but its implementation is CPU specific, using all advantages the specific CPU offers - like the mentioned working of 8-bit register pointers and only increasing memory base pointers every 256th cycle on a 6502. Operating on a greater abstraction in a language allows the compiler and/or runtime do employ greater optimization than fixing low level-detail within the source code.

C, in turn, exaggerates this by being tied to very specific low-level operations and using them all over and in every application source. Much without an intermediate runtime layer. In this respect, C is way less a high -level language than others, and way more prone to CPU specific issues.

Learning from History

Looking back (*3), the last 30 years do show two developments to bridge the problems of less than 'simple' CPUs and too simple languages. The 8086 family is not only an important, but eventually the best, example for changes in CPUs, as it is a not a simple CPU at first. Sure, compared to the Z80, it is much more powerful and symmetric - still, not as simple as C assumes it to be.

Over time, the x86 got not only instruction set additions such as scaling factors to move array indexing calculations into microcode, but the whole CPU got redesigned in a way that instruction sequences are analyzed, reordered and reformed to make C-like operations perform better. Bottom Line, the 8086 became more PDP-11ish. One way to close the gap.

At the same time, the C Standard development worked hard to define a common set of data types and functions thereon that now can be used by the compiler to get a glimpse of the why instead of the how. These source statements (may) no longer be directly translated into function calls, but be used by the compiler to generate different, more specialized, target optimized code. In the end, a way to make C a bit more high level than originally intended.

What's the Lesson for Z80 Users?

Well, one might be not using C at all :) (*4)

Another, more practical, way is to go the same path that standard C is doing: Use more task-specific high-level functions and optimize them (in assembly) for the Z80.

The last would be to optimize existing C compilers for the Z80 to generate a more CPU-embracing code structure. For example with different ways of parameter passing depending on functions' use and so on.


BTW: The 6502's short call stack is often cited here, but there is no relation to C. C doesn't require the usage of the return stack for parameters. It can as well be a separate parameter stack. In fact, strictly speaking, C doesn't require a stack at all.

C does require a way of bookkeeping for nested calls, some way of parameter passing (with undefined length) and a way to handle local variables. How this is done is up to the compiler (or its creator). Using some hardware stack is one (simple) way, but not necessarily the best with a given CPU.


*1 - As a 6502 and Assembly guy I do feel deep down they are not false :))

*2 - No, the C-LIB isn't a runtime as part of the language: it is a collection of standard functions, itself (almost) completely written in C, and compiled/linked at compile time.

*3 - Looking back is rather rare in IT, but we are Retrocomputing - we not only play nostalgia but also try to learn from history, don't we?

*4 - A serious choice could be Ada. Due its declarative nature, code generation can be way better optimized for individual CPUs. After all, it was one of the main goals of Ada's development to be able to produce good code no only for mainframes but also for little bastards like an 8048. There have been several special Z80 compilers during the 1980s; most prominent may be RR Software's Janus/Ada 83. While no longer mentioned, there was also a Z80 version.

Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 2
    Also you can optimize your C code in non-intuitive ways to take advantage of your target architecture. Anyway, C and C++ are slowly adding ways for programmers to signal their intentions to the compiler, such as the foreach keyword in C++11 and the uint_fast8_t datatype in C99. – snips-n-snails Mar 28 '18 at 20:05
  • 8
    okay but not ADA, Ada. It's a noun, not initials. – Jean-François Fabre Mar 29 '18 at 08:02
  • 2
    I have used numerous C compilers that required a certain amount of run-time support. An example would be a compiler that emits a call to memcpy(...) when you write a = b; and the type of a and b is some struct data type. Another example would be a compiler that emits a call to a software floating-point library when you write a+b. – Solomon Slow Mar 29 '18 at 14:06
  • 5
    In what sense is Ada 'declarative'? It's an imperative programming language. – Max Barraclough Mar 29 '18 at 14:55
  • @jameslarge - most C compilers I've encountered are able to make do without their library. Which ones have you worked with that can't? – Jules Mar 29 '18 at 17:23
  • @MaxBarraclough Well, what about defining the classic trafic light as a sequence of distinct states (R,RY,G,Y), assining these 4 states to three bits and spreading them out over non consecutive portbits or ports. Not exactly imperative, right? Lets be honest, these seamingly clear desinctions may have had some truth in the Fortran vs. Prolog times, but language development, with Ada as a forrunner, has left that cuddly state a long time ago. – Raffzahn Mar 29 '18 at 17:37
  • 5
    The Z80 sucks a bit and C sucks a bit, but contemporary C compilers sucked a lot. Yesterday I tried compiling a simple C program with Hisoft C on a Spectrum +3. What a pain! And the code sucked. A much better compiler could be developed, but it would take a lot more effort (and be less enjoyable) than just continuing to code in assembler. – Bruce Abbott Mar 29 '18 at 19:33
  • @BruceAbbott :)) looks like there's an open nice for a good (or at least acceptable) 8 bit compiler. Let's start another fialed project ... or should we ratehr invest that time in creating a usable assembler? After all, most suck as well. – Raffzahn Mar 29 '18 at 19:38
  • @Jules, Depends what you mean by "make do." I don't remember the names or version numbers of C compilers that I used twenty and thirty years ago (this is "retrocomputing," right?) but of course you could always compile your code. Whether or not you could link it would depend on whether you used any language features that needed run-time support. I remember using the stock compiler on a 68020 box running some Unix variant, sometime in the late 1980s, to cross-compile for a bare-metal 68000. I had to write about six small assembly language routines, in order to be able to link my C code. – Solomon Slow Mar 29 '18 at 19:53
  • 3
    I think another way of stating the point in your answer is that writing efficient code for the Z80 requires taking registers into account when choosing what order to do things in, so you (or the compiler) don't have to swap HL / DE or spill/reload things as often. Even given a smart compiler, it might not always be able to prove enough things to reorder / transform operations, so writing code that compiled efficiently would require thinking about how it would compile in more detail than for a more orthogonal compiler target. i.e. mentally design your program in asm, then write C. – Peter Cordes Mar 30 '18 at 03:41
  • 1
    But that's not the kind of C that usually gets written, and even then there's probably be things that were hard to express in C in a way that let the compiler do what you want in asm. (The only reason to jump through hoops like this instead of just writing asm in the first place is to have C you can compile for other platforms.) And this idea is predicated on the existence of a good compiler that spots all the Z80 tricks the C source allows, which I assume doesn't exist. – Peter Cordes Mar 30 '18 at 03:43
  • Ada's main advantage in optimizing is that it discourages aliasing (there's no "get me a pointer to this" operator), and the strict rules mean the compiler knows much more about the code than a C compiler does. That's not an advantage that can't be overcome by a C compiler-writer willing to put in more work, but it is more work. – T.E.D. Mar 30 '18 at 13:42
  • @T.E.D. One can write good clean and error free code in any language. Ieven say Assembly is the best one todo so. It's all about the programmer to not to ump thoughts. – Raffzahn Mar 30 '18 at 14:02
  • 3
    @Raffzahn - Quite true. Its also true that a good craftsman can build a quality house with nothing but hand tools. – T.E.D. Mar 30 '18 at 15:46
  • @T.E.D. Not sure how that's related, but yeah :)) – Raffzahn Mar 30 '18 at 15:51
  • A lot of the answer is predicated on footnote #2, and footnote #2 is just plain wrong. Several standard library functions, including string and memory operations, are regularly implemented by compilers as intrinsics. To state that the C standard library is a library of callable functions is incorrect. They are not required to be functions in an object code library, and in the cases relevant to the argument propounded in this answer, are regularly not. Furthermore, they most definitely are CPU-specific, and regularly written in assembly language targetting the features of the specific CPU. – JdeBP Mar 31 '18 at 09:52
  • @JdeBP You might want to read it at whole before ranting out of context. The footnote you're citing refers to the way C is made. It's a core feature of C to reduce the need for CPU specific compiler and libraries as much as possible. The answer well recognizes that many tweeks like you mentioned have been added to some compilers. Just, the question is about Z80 and C, not what has been done on a PC. So if you got the knowledge, why not going ahead and make that true for the Z80, instead of ranting about theoretical issues? – Raffzahn Mar 31 '18 at 10:56
  • I already read it, thank you, and it is as I stated. You've just got this quite wrong, and making personal comments just demonstrates a lack of any good counterargument. It is not a core feature of C to reduce the need for these libraries. On the contrary, it is a feature of the C standard library to make available stuff that is not best implemented, or even possible to implement, in the C language, and its string and memory operations are even held up quite often as examples of this. Your argument based upon the idea that say memmove is not encapsulating CPU-specific stuff is very flawed. – JdeBP Mar 31 '18 at 11:31
  • 1
    @JdeBP Serious? When why were things like memmove originally implemented in C? Before trying to rewrite hostory in retrospect, learning about is a good idea. – Raffzahn Mar 31 '18 at 12:44
  • 3
    @JdeBP I think you try to argue with today's compiler technology and philosophy against stuff from 20, 30 years ago. The use of compiler intrinsics for standard features like memcpy et all, for example, only started seriously about 10, 15 years ago. So, for a today gcc or clang, you are absolutely right. For a HISOFT C compiler in 1985, quite not so. – tofro Mar 31 '18 at 12:55
  • The first two bullet points are false (C is not tied to PDP-11 -like architectures and the Z80 is not really special). The third bullet point is an advantage for small processors. A minimal C runtime consists of a stack and that's it. – JeremyP Apr 04 '18 at 13:27
  • 1
    @JeremyP: A minimal conforming C implementation on most platforms would also need to contain code to initialize all objects of static duration. – supercat Jun 20 '18 at 15:41
  • @supercat I wouldn't normally consider the program's own prologue code as part of the runtime. – JeremyP Jun 21 '18 at 12:42
  • @JeremyP: I would regard a C program as consisting of user written functions, ordinarily-callable library functions, and "the runtime", and "main()" as being a user-written function just like any other. As such, the runtime would consist of code to perform all static-object initialization that occurs before main(), as well as functions to do things like 32/32 division if the program happens to need them (which a lot of programs wouldn't). – supercat Sep 17 '18 at 15:36
  • 1
    @supercat I'm going to modify my previous statement slightly: if a user writes int a = 5; at global scope, that is part of the programmer's code as much as any function definition. If they just write int a; again that is part of the programmer's code - making use of implied semantics. However, other parts of the prologue e.g. setting up the stack, adding the command line arguments to it, are part of the runtime. – JeremyP Sep 19 '18 at 13:11
  • @JeremyP: On many implementation that load programs into RAM from an executable file or other medium on startup, the mechanisms that loads programs will initialize all static objects with values stored in that file before executing any of the instructions contained in it. Some others allocate const-qualified static objects in ROM, so that their value is effectively "hard-wired" into the machine, and set up another area of ROM to hold a bit pattern that can be copied to the "initialized static objects" area of RAM to initialize all static objects as a group. – supercat Sep 19 '18 at 19:11
  • @JeremyP: While I've used some implementations where "int a=5;" would cause code equivalent to a "mov word [_a],5" to be executed at startup, they're really quite rare. In most cases, static initialization will either be handled by a program loader (in which case it's definitely not part of a user program, since it's not even in the same executable!) or a startup routine equivalent to memcpy(__init_static_RAM, __init_static_ROM, __init_static_size);, is hard-wired.other than constants which get filled in by the linker. – supercat Sep 19 '18 at 19:16
  • @supercat Initialised variables will be compiled into a segment in the executable just like the machine instructions in the functions you define. If you are claiming that something loaded by the loader is not program code, you must exclude all the compiled functions as well. – JeremyP Sep 21 '18 at 15:04
  • 2
    @JeremyP: Many "small processor" systems have the entire "executable" stored in read-only memory, using a "loader" which is disconnected from the rest of the system before deployment. The table of initial values for non-const static objects is part of the program, but its contents will need to be copied to RAM before the user program starts. I would call the code that performs such a copying operation part of the runtime which is bundled with the implementation, since the only aspect which can vary between programs are the ranges of addresses to copy, which programmers generally won't know. – supercat Sep 21 '18 at 15:21
  • 1
    You forgot the third simple answer: Compilers suck (-: – hippietrail Oct 04 '19 at 09:36
  • Apologies for the late comment, but why 6502, a (not-so) RISC-ish retro 8-bit cpu, is considered less suitable than Z80 for HLL like C? REF: https://www.xtof.info/coding-c-8-bit-6502-cpu.html – Schezuk Apr 14 '21 at 03:32
  • @Schezuk Nice writeup, except it seams as if the author hasn't really thought it thru fully thru - like citing 'limits' that are relevant to all 8 bit CPUs as 6502 specific. But yeah, everyone is entitled to an opinion. – Raffzahn Apr 14 '21 at 05:16
  • @Schezuk The main problem is that the hardware stack is only 256 bytes in size. The other main problem is that the 6502 only has one general purpose register and it's only 8 bits wide. Raffzahn was correct to say that C doesn't require a stack, but the only two effective means of passing parameters and arguments that are in general use are using the stack or the registers. – JeremyP Jun 05 '21 at 21:46
  • @Schezuk Raffzahn is just wrong. The only way to get a variable onto the stack is to load it into the accumulator and then put it on to the stack byte by byte. You could use zero page to create a 16 bit stack and also in lieu of registers, but it's not as fast as having multiple registers in your CPU. That's pretty much the top and bottom of it. – JeremyP Jun 05 '21 at 21:57
  • 1
    @JeremyP I wouldn't focus so hard on common implementation. After all, the issue is about passing a parameter (list). This isn't restricted to any of the methods you list. It could bedone with whatever one can think of. like simply some memory shared between caller and callee, right? They just have to agree where to find it. And in quite a lot of cases the parameter list is rather static, so why building it every time again? Maybe let the compiler do it? After all, al this has been used long before stacks were a common resource. Keep in mind, implementation is free and new every day. – Raffzahn Jun 05 '21 at 23:09
  • @Raffzahn The point is that all the other methods are slow in comparison to the two common methods. With C, you can't just designate a shared memory area because functions can be called recursively. At some point you need a structure that looks somewhat like a stack. If you've got a hardware stack, it's faster. Hardware stacks, by the way, have been around since at least 1961. – JeremyP Jun 06 '21 at 08:38
  • @JeremyP Not really. The stack as push/pop of registers is just a convenient one size fits all solution - with all the unpleasant side effects of such. Recursion is not a standard case. The point is simply that it doesn't need a singular structure with register based push/pop operations. Dynamic creation via push is inefficient on next to all CPUs, no matter if supported by hardware or not. And it's only needed in a minority of cases. BTW: Stack was invented in 1957, while the most influential CPU-line of all times, created in 1965 didn't had one - iterations thereof still working great today – Raffzahn Jun 06 '21 at 14:32
  • 1
    @JeremyP: The Standard does not require that an implementation support any particular level of function nesting, nor does it say anything about what happens if an implementation's function-nesting limit is exceeded. Many useful C compilers for platforms like the 8051 and PIC simply disallow recursion, and are consequently far more efficient than they could be if recursion had to be supported. – supercat Jun 07 '21 at 19:15
19

Well, I personally find it annoying reading so many comments here about what modern compilers can and cannot easily do. It is terrible what wishful thinking does to your brain. OK. Let me show why people who still remember how to code Z80 hate C compilers. This is a trivial C code that I was hoping to compile:

int i,data[10];
main() {
  for (i=0; i<10; i++)
    data[i]=0;
}

This is the Z88DK output using zcc -O3 -a trivial.c:

._main
    ld  hl,0    ;const              ; i=0
    ld  (_i),hl
    jp  i_5
.i_3
    ld  hl,(_i)                     ; i++
    inc hl
    ld  (_i),hl
    dec hl
.i_5
    ld  hl,(_i)                     ; if i>=10 GOTO i_4
    ld  de,10   ;const
    ex  de,hl
    call    l_lt
    jp  nc,i_4

    ld  hl,_data                    ; HL = data + i
    push    hl
    ld  hl,(_i)
    add hl,hl
    pop de
    add hl,de

    ld  de,0    ;const              ; (HL) = DE
    ex  de,hl
    call    l_pint
    jp  i_3
.i_4
    ret

I am not counting t-states and not including the code in the case when i and data[10] are declared as char, because I do not have a goal to embarrass the compiler authors.

OK, maybe SDCC can do better? At least it can deal with char data type in a sane way. So we create

char i,data[10];
main() {
  for (i=0; i<10; i++)
    data[i]=0;
}

and SDCC compiles it using sdcc -mz80 --opt-code-speed into

;trivial.c:21: for (i=0; i<10; i++)
    ld  hl,#_i + 0
    ld  (hl), #0x00
    ld  bc,#_data+0
00102$:
;trivial.c:22: data[i]=0;
    ld  hl,(_i)
    ld  h,#0x00
    add hl,bc
    ld  (hl),#0x00
;trivial.c:21: for (i=0; i<10; i++)
    ld  iy,#_i
    inc 0 (iy)
    ld  a,0 (iy)
    sub a, #0x0a
    jr  C,00102$

So, the addition of char to pointer is done in 16 bits, the index registers are used for some unknown reason, but otherwise this at least begins to look like an assembly program. So, if I ignore the preamble and just count t-states per iteration of the main loop from 00102$:

16+7+11+10 + 14+23+19+7+12 = 119 t-states per byte

As a comparison, this is what a relatively inefficient assembly code may look like (I wrote this very closely to what my C for-loop implies, so that compiler at least has a chance of getting this right):

         ld hl,data_addr
         ld a,0
loop:    ld (hl),0
         inc hl
         inc a
         cp 10
         jr nz,loop    ; 10+6+4+7+12 = 39t

If counter is allowed to go in the opposite direction, a similar loop in my other answer to this question does the job in 25.5 t-states per byte. The fastest Z80 code for memory filling can average below 10 t-states per byte, but this is not an exercise in memory-filling, this is a simple test of what some trivially simple code tends to be compiles into.

So, this is my brutally honest answer to your question why people like myself say that C compilers for Z80 produce poor code: BECAUSE THEY DO.

Toby Speight
  • 1,611
  • 14
  • 31
introspec
  • 4,172
  • 1
  • 19
  • 29
  • 1
    Just to finish off the thought, presumably if you were writing itself you'd store a zero byte then LDIR the rest? Without being explicit, it's not likely to be clear to everyone why 119 is a bad number. – Tommy Mar 29 '18 at 20:12
  • Just check out my other answer here, where I wrote really pedestrian loop that works in 25.5 t-states per byte, i.e. almost 5 times faster, and I am not even counting the code sizes (maybe I should). I did not use LDIR because it is much further from C for-loop semantically (i.e. no hope to get it from compilers any time soon). – introspec Mar 29 '18 at 20:15
  • 1
    Actually, not too worth getting involved in whether a C compiler should use LDIR here, because I think the answer is likely to be: it should, but you should use memset or some other overly-specific take on the example when the point is clear as is. But I just meant: to the casual reader, coming along and reading this answer, you assert that the generated code is awful — and I'm not disputing that — but it might be more convincing if you showed non-awful code for comparison. That's all. No dispute as to information and data stated. – Tommy Mar 29 '18 at 20:23
  • Got you. Will add a comparison code. – introspec Mar 29 '18 at 20:24
  • The addition of a char to a pointer has to either be done as 16 bits or done using the accumulator with conditional logic for a page crossing. The code there uses 3+2+1=6 bytes and 14+7+11=32 T-states to convert "i" into "data+i" in HL, which doesn't seem very good, but unless one reworks the code to use a marching pointer I don't see much way to improve it given the limitations of the Z80's instruction set; I think the code could be more efficient if "i" were a union of an "int" and a "char", and the array subscript used the "int". – supercat Mar 30 '18 at 01:32
  • 3
    Compiler writing has two parts: parsing and code generation. Nobody complains about the parsing, everyone complains about the code generation (CG). Basically, it is just straight CG - no optimization. CG is a dark art - you have to know the instruction set very well and how to optimize. You'd probably expect good CG from a large corporation but not from a one-man band: this is only a part time thing and they have day jobs. They also have to write the support for most of the common headers and supporting libraries. That is a task in itself. Then there is the linker. – cup Mar 30 '18 at 05:16
  • @cup, but this is precisely why I do not like the attitude of so many people pretty much taking these optimizations for granted, saying that they should be there. It was not easy even during the days of commercial use of Z80, it is probably next to impossible nowadays, when the number of (paying) users is close to zero. – introspec Mar 30 '18 at 06:59
  • 1
    @introspec: my comments on other answers saying what modern compilers (e.g. for x86) can do were making the same point that you are here. Efficient compilation would be possible given a smart optimizing compiler, so the terrible code-gen from real Z80 compilers is more a result of massive missed-optimizations, not of C being inherently impossible to compile efficiently (although C source with multiple pointers used at once would be a problem!) – Peter Cordes Mar 30 '18 at 15:55
  • 1
    e.g. a Z80 backend for modern gcc or LLVM could do a lot better cross-compiling from a powerful computer (if anyone put in the amount of development time it would take to find target-specific optimizations, too), vs. real historical Z80 compilers. Writing an optimizing compiler is a huge challenge / amount of work. My point was always that compilers could do whatever optimizations (and do for x86 / ARM / whatever), not that any good Z80 compilers exist or could be made easily. – Peter Cordes Mar 30 '18 at 15:58
  • @Peter Cordes, I think that we are saying the same thing, only you see it positively and I do not :) There is a big gap in my mind between "could do" and reality. Almost uncrossable. The answer by aralbrec shows what can be done if the modern compiler is used inventively, but even then the gap in performance is massive. – introspec Mar 31 '18 at 19:35
  • Yeah, I'm saying "possible in theory", but yes there's a huge gap filled with practical obstacles like getting any funding for the person-years of dev time it would take. Lots of the examples thrown around here could compile efficiently, but lots of existing C code with lots of indirection through pointers probably couldn't compile very well (without AI levels of whole-program transformation going way beyond what real compilers like gcc currently do for AVR (limited pointer registers but many total registers), and even then probably not strictly possible according to the as-if rule). – Peter Cordes Mar 31 '18 at 19:45
13

The answer to this question must be opinion-based anyway, and written by the specialist who was designing Z80 C compiler. I will give it a try though.

I used MSX-C compiler made by ASCII together with Microsoft back in old 80-90's days; the platform was MSX. I do not recall if it used stack to pass arguments, however it would be logical given compiler can use IX and IY assigning them to stack pointer and addressing arguments by bytes through (IX+n). I am more than sure Turbo-C version 2.0 for PC XT/AT I have used back in 90s was doing the same using register BP.

One remarkable thing I recall from using MSX-C was that its output was not Z80 code, but 8080 code. Most probably compiler was originally designed for 8080, and then just ported to Z80, thus was not aware about IX and IY registers.

Regarding (IX+n) and (IY+n) commands. N is signed byte, thus you can address -128 to +127 from the base of the index register. Then, n must be a constant, thus changing it is possible within RAM by replacing byte of the executable code, which is another level of the optimization which most probably was not considered those old days.

So what are the reasons C fits badly

My personal opinion:

  • For old compiler software developed back on old days, compiler developers were (1) focusing on reliability of the compiler's job; (2) speed of compilation; also keeping in mind that (3) register set is not so big to have much optimization with it.
  • For new compiler software it must be either developed by the real enthusiasts who are also experts in compilers (that is, to my knowledge, special field in computing), or have commercial interest (questionable if it is possible though these days).

So what are the reasons C fits badly

In general I would like to see example. MSX-C did job in four steps (yes, four!).

  1. CF.COM was parsing the C code, creating some output file;
  2. CG.COM was "code generator" which generated assembly language text file;
  3. M80.COM was creating .REL object file, which then
  4. linked by the L80 with other object code (e.g. libraries).

There're pros and cons for this architecture, and there should be also historical reasons. CF and CG are about 30-40KB each, thus you can not "merge" them into one executable because it will then simply not fit into the RAM (not talking about work area); M80 used human-readable assembly text files, thus programmer had an opportunity to look at assembly code and get an idea what real executable could look like and what s/he can do to improve it, or inject own assembler routines at the linking stage.

Raffzahn
  • 222,541
  • 22
  • 631
  • 918
Anonymous
  • 1,296
  • 8
  • 11
  • 5
    "another level of the optimization which most probably was not considered those old days" -- I think self-modifying code was much more likely to be considered back then than now, to be honest. I don't know of any compilers that did it for this purpose, but the virtual machine for Smalltalk (developed circa 1978-1980 IIRC) definitely used it for optimizing the need to use indirect calls to object methods. But then that was aimed at a minicomputer-type processor that was somewhat more capable than the Z80. – Jules Mar 28 '18 at 16:15
  • 2
    Those days a lot pf programming was focused on ROMs rather than execution in RAM. You are right, this technique was considered and use a lot, not sure about compilers though as at minimum there should be some compiler flag telling it that application is going to be run in ROM. – Anonymous Mar 28 '18 at 18:16
  • old compilers such in optimization, for most architectures. That's why people often had to hand-optimizing C code using "weird" techniques – phuclv Mar 29 '18 at 04:39
  • *suck in optimization. For example Turbo C. Modern compilers have more optimizing capability, even automatically multithreading with openmp or autovectorization – phuclv Mar 29 '18 at 08:59
  • ASCII is a character encoding. Do you mean ANSI C? – Rosie F Mar 29 '18 at 10:39
  • 1
    @RosieF no, I meant ASCII-C http://msx.hansotten.com/software/msx-c-manual/ made by Japanese company called ASCII https://en.wikipedia.org/wiki/ASCII_Corporation together with Microsoft back in 80s. Probably you are right, and I must change to MSX-C. Will edit the answer. – Anonymous Mar 29 '18 at 11:10
  • Putting multiple compilation steps into separate executables makes perfect sense given limited RAM. Modern portable compilers like gcc and clang still have those separate steps, e.g. front-end that parses C (or fortran, or Go, or Rust) and turns that into an internal representation of program logic (LLVM-IR or gcc's GIMPLE). Then (the step missing in Z80 compilers :P) run some optimization passes on that GIMPLE, before handing to target-specific code-generation functions that transform it into asm for whatever target platform (x86, AArch64, PowerPC), then link. – Peter Cordes Mar 30 '18 at 16:15
  • gcc really does write out text assembly language as part of normal compilation, and pipe that into a separate assembler program. clang and most other compilers have an assembler "built in", and only emit text asm when you ask for it with a command line option. But anyway, in modern gcc only steps 1 and 2 are merged into one executable, and are still separate steps within that program. The gcc command itself is a front-end that runs the compiler / assembler / linker as needed depending on input files given. Use gcc -v -O2 foo.c to see the steps. – Peter Cordes Mar 30 '18 at 16:18
  • 1
    The four steps were not unusual - it made it possible to do. It was not until Turbo Pascal and C that the single step all-in-memory compiler was demonstrated possible.and one of the reasons that the Turbo products became very popular. – Thorbjørn Ravn Andersen Dec 23 '18 at 11:50
13

While the Z80 is definitely an 8-bit processor rather than a 16-bit one, the instruction set makes some operations easier with 16-bit values than 8-bit values. For example, given something like: a=b+c+d; with all variables being 16 bit types and having static duration could be realized as:

    ld  hl,(_b)
    ld  de,(_c)
    add hl,de
    ld  de,(_d)
    add hl,de
    ld (_a),hl

but trying to do it as 8 bits would require a different approach:

    ld  a,(_b)
    ld  hl,_c
    add a,(hl)
    ld  hl,_d
    add a,(hl)
    ld  (_e),a

It's possible to generate efficient code if all operations use 8-bit math or if all use 16-bit math, but 8-bit and 16-bit operations require totally different approaches, and trying to combine them gets awkward (e.g. if b and c were 16-bit values, but d was an 8-bit one, the most efficient way to add d would be to load it and the following byte into DE, then clear D, and then add DE to HL). If a compiler wants to try to handle 8-bit math efficiently, it will have to use code generation logic that's very different from what's needed for 16-bit math, and a lot of compiler writers aren't going to want to massively increase the size of their code generator for that.

supercat
  • 35,993
  • 3
  • 63
  • 159
  • That's better yet addresses only 2 of the 3 items in my comment: mov is not part the usual Z80 ASM syntax. The second paragraph of code still does not make sense. Only one Z80 ADD operation cannot add b+c+d. – Stéphane Gourichon Sep 17 '18 at 15:04
  • Your code seem to assume _a _b _c _d are fixed address variables. I would do ld a,(_b) ; ld b,a ; ld a,(_c) ; add b ; ld b, a ; ld a,(_d) ; add b ; ld (_a), b. Is that what you meant? – Stéphane Gourichon Sep 17 '18 at 15:08
  • This answer is interesting because it highlights what is observable in the last column of the table at http://z80-heaven.wikidot.com/instructions-set:ld : immediate 8-bit value at (NN) can only be loaded to A, while immediate 16-bit value at (NN) can be loaded to BC, DE, HL (or even SP, IX or IY). – Stéphane Gourichon Sep 17 '18 at 15:13
  • @StéphaneGourichon: Some processors (e.g. 6502) seem very good at allowing direct addresses to be used for almost anything, some others (including the ancient CDP1802 and modern ARM) allow them to be used for basically nothing, and some (e.g. Z80) allow their use in some instructions, but not others, somewhat arbitrarily. I've sometimes wondered whether it might have been practical for the Z80 to have used one of the opcode-extension prefixes as an address modifier that would substitute a direct address for (HL) in the following instruction without disrupting the value in HL. – supercat Sep 17 '18 at 15:24
  • @StéphaneGourichon: I think the number of opcodes with an ED prefix is small enough that, if they'd been placed differently, an ED prefix might have been able to serve such a role. The sequence ld hl,_c / add a,(hl) wouldn't have been any smaller than add a,(_c) [since the latter would require a prefix byte] but in many cases could improve the efficiency of surrounding code by allowing a useful value to be kept in HL. – supercat Sep 17 '18 at 15:27
  • @StéphaneGourichon: I think used ld twice for add in the second example, but it should be fixed now. Your approach using ld a,(direct) and ld (direct),a would work as I assume you meant it (ending with ld (_a),a) but it's seven instructions totaling 15 bytes, while the approach loading the address of _c and _d into HL and then using add a,(hl) would be six instructions totaling 14--saving a byte and four cycles, but at the expense of trashing HL. – supercat Sep 17 '18 at 15:32
  • 1
    @StéphaneGourichon: Incidentally, after writing the answer above, I discovered that the Z80 has some 8-bit and even 16-bit internal data paths, its primary ALU is only 4 bits. An instruction like INC HL uses a 16-bit limited-purpose ALU which takes two cycles to perform an operation, but INC HL takes six cycles because that ALU gets used twice during each instruction fetch (once to increment PC, and once to increment R), thus requiring that two cycles actually performing the operation get added to that. – supercat Sep 17 '18 at 20:12
  • 1
    @StéphaneGourichon: Something line INC A actually requires using the four-bit ALU twice, but it's faster than INC HL because both operations can be done at the same time as the 16-bit ALU is being used to increment PC and R. – supercat Sep 17 '18 at 20:13
11

The Motorola 6809 is probably the only legacy CPU of the 80's which is well suited for C compiler, thanks to several advanced features (for the time) : - orthogonal instruction set - rich addressing mode - hardware multiplier, to quickly compute addresses - position independant code

This kind of CPU (and the improved 6309) can be find in some home computers (Vectrex, Tandy Coco, Thomson, ...) and a lot of embedded systems.

Emmanuel
  • 111
  • 2
  • 4
    Indeed I remember magazine articles at the time saying the 6809 was designed for C, although I have no idea how authoritative those articles were. However, the question is about the Z80. – Chenmunka Mar 29 '18 at 15:28
  • 1
    @Chenmunka that would be a strange argument, as C wasn't any important language back then. Even less a reason to make a CPU fit it. But yes, the 6809 was (much like the 8086) especially designed with high level languages producing linkable modularized code in mind. – Raffzahn Mar 29 '18 at 20:19
  • 2
    This could answer the question with a little re-wording. These are features that the 6809 had that made it well suited to C, but what features does the Z80 not have that makes it not well suited? – wizzwizz4 Mar 31 '18 at 08:10
  • You said "80's" but you appear to have meant "tail end of the 1970's". Plenty of CPUs tolerably suitable for C code came out during the decade of the 1980's, targeting all of PCs, workstations, and special purposes. Many of their descendants continue to run such code today. – Chris Stratton Mar 31 '18 at 19:41
  • 1
    @ChrisStratton: Perhaps he meant the one 8-bit CPU of that era. Microchip has added some features to some of their 8-bit line in an effort to make them compiler-friendly, though IMHO they made some significant missteps in their design. – supercat Mar 31 '18 at 23:41
  • The 68000 was widely used in the '80s: Sun, Atari, Commodore, Apple... – Gaius Apr 02 '18 at 12:45
  • 1
    I would disagree there. The 6801/3/6303/68HC11 is well suited for C code generation (unlike the 6800) as they added TSX and ABX instructions as well as PSHX/PULX to fix the gaps a compiler needed. A 6303 at 2MHz can beat the crap out of a Z80 at 8MHz with C code because you can do things like '16 bit add an offset from index to accumulator' in 2 or 3 cycles and get the stack pointer into the index register in one. The 8085 also has a superb instruction set for C but Intel chose not to document the 8085 extensions presumably due to the 8086 coming out. – Alan Cox Dec 06 '19 at 14:49
  • @AlanCox: I've done C programming on the HC11, and there was a substantial difference in performance between functions whose locally-declared objects were all static, versus those which used automatic-duration objects. The HC11 may have been less bad than the 6800, but the 6809 could access the first 16 bytes in a stack frame using stack-relative addressing using only one more cycle and zero more bytes than would be required for zero-page direct addressing. – supercat Jun 07 '21 at 19:36
  • @supercat With gcc I don't see much difference. It's one instruction (TSX or TSY) to then be able to index the top 256 bytes of the stack via an index register. On the 6303 TSX is even pipelined nicely. So the cost on a 6803 is TSX LDD 5,X versus LDD _xyz and both are 3 bytes. A good 630x/680x/68HC11 compiler also uses direct page accesses for register variable equivalents and hot locals. Where the 6809 wins is the fact you don't have to consume a register even temporarily for some stack accesses and having more registers as well as increment/decrement side effects and crucially also lea – Alan Cox Jun 08 '21 at 23:10
  • @AlanCox: On the Introl compiler I used for the HC11, automatic variables were indexed using Y, leaving X available for general-purpose indexing. This added an extra prefix byte (and associated instruction fetch cycle) on every access to an automatic object. Using direct page for hot locals improves efficiency enormously, and I defined macros for compiler-specific qualifiers that did just that, but such an approach can only be used with functions that will not be invoked recursively. – supercat Jun 08 '21 at 23:19
  • @AlanCox: Further, while I don't remember the precise details, the 68HC11 required a function prologue that was a bit bulky, but could be omitted if a function didn't use any automatic objects. If I recall, the 6809 can simply use LEA with the stack pointer as a destination, but the 68HC11 had no such ability. – supercat Jun 08 '21 at 23:21
  • @supercat So the introil compiler was not a good 68hc11 compiler but an ok one. 680x function prologue depends a lot on the space. For small amounts or where you are assigning values the cost is the same (LDX PSHX), where you don't it's a bit messier as there is no SBX only ABX (so cleanup is nice). Most functions have only a small amount of local data so the generated code is often a sequence of PSHX which is far less pretty. – Alan Cox Jun 08 '21 at 23:28
  • @AlanCox: The Introl compiler always generated correct code. I've used probably a dozen C implementations, and found at least one real code generation or library bug in most of them including compilers form CCS, HiTech, Borland, Apple (MPW), TI, gcc, and clang. The Introl compiler, however, never generated erroneous code. As far as I'm concerned, that puts Introl well ahead of competing vendors. The TI problem seemed to have been an incompatibility with Windows NT handled file buffering between streams, since the assembly language file would get mangled, but it's code generation... – supercat Jun 09 '21 at 02:16
  • ...seemed sound and a later version was compatible with Windows NT. The gotcha I hit with Borland's Turbo C 2.0 was its mishandling of e.g. fprintf("%1.1f",99.99, which it would erroneously output as 00.0 rather than 100.0 [a variant of the infamous "Windows 3.11-3.1" bug]. MPW would use a 16-bit signed subtract to allocate stack frames between 32768 and 65535 bytes in size. CCS was just plain buggy and HiTech occasionally failed to set banking bits properly. – supercat Jun 09 '21 at 02:19
  • @AlanCox: I wish there were a compiler that chip vendors could supply that placed higher priority on correctness than gcc and clang seem to. I really like the Keil compiler I use for Cortex-M0 and M3, but non-evaluation licenses are rather pricey. I can't remember any bona fide bugs I found with that implementation, beyond the fact that code which attempts to use VLAs without making provision for the heap, or stdout without making provision for it, will build successfully but crash on startup (and such issues are more annoyances than bugs). – supercat Jun 09 '21 at 02:22
10

A lot of the existing answers feel more like showing off with code golf and an only partially-justified objection to the quality of C compilers based on historically-bad implementations. However, a third-party Z80 backend now exists for LLVM and clang, so it is instructive to see what a state-of-the-art compiler and optimiser, and best-efforts hobbyist codegen look like.

The TL;DR of my response is "yeah, it's not optimal thanks to the lack of registers, lack of fancy indexing modes, and non-orthogonal instruction set, but 'poor code' is also subjective and it's possible to have something which is good enough and certainly better than the output of shoddy 1980s compilers."

Anyway, here's my test file:

int data[10];

void zerodata() { for (int i=0; i<10; i++) data[i]=0; }

This is similar but not the same as some of the other examples given in other answers. In particular, i is a local variable. If it is a global, it takes extra space in the data section, and extra code has to be added to store it after the loop completes. So we see the first reason why C can produce worse results than assembly: people who are experts at writing tight assembly language may not be so hot at C and accidentally leave performance on the table.

If I compile that with -Os, this is what comes out:

    ld  hl, _data
    xor a
    ld  de, 20
    push    de
    push    hl
    call    _memset
    pop hl
    pop hl
    ret

Modern compilers do very sophisticated loop analysis and can readily detect memory copies and initialisations. In this case, the function has been turned into a call to memset(data, 0, 20). This is a good optimisation, since memset is inevitably hand-tuned assembler.

So we already see one way how (this implementation of) C on Z80 is less performant than it could be: it pushes parameters onto the stack rather than pass them in registers. This function does not take parameters so you do not see the code, but unpacking parameters bloats the function preamble somewhat in a way that register-passing would not.

Disabling use of memset with a function attribute gets us this:

    ld  de, 0
    push de
    pop iy
LBB0_2:
    push    iy
    pop bc
    ld  hl, _data
    add hl, bc
    ld  (hl), e
    inc hl
    ld  (hl), d
    ld  bc, 2
    add iy, bc
    push    iy
    pop hl
    ld  bc, 20
    or  a
    sbc hl, bc
    jp  nz, LBB0_2
    ret

Now it's doing an actual loop and we're seeing how it's handling the lack of orthogonality in the instruction set and the assumptions in LLVM. IY is being used for i, but actually contains 2*i to avoid a multiplication/shift inside the loop, and DE contains the 16-bit constant 0. On each iteration of the loop it computes _data + IY, stores DE at that address. Then it adds 2 to IY. Finally it computes IY - 20 and loops if this is not yet zero. What can we conclude here? I would certainly call this "poor code".

I'll not post the asm, but the same function compiled for ARM actually computes the one-past-end address of data and then does a decrement loop on i because ARM actually has a base-minus-shifted-index addressing mode to store zero into &data[10] - 2*i in a single instruction. It uses r0 through r2 for temporaries.

On Z80, LLVM tries a different approach due to the lack of that addressing mode. I think it's not tried terribly hard, but we can see that it's struggling somewhat due to the non-orthogonality of the Z80 instruction set and the lack of indexed addressing modes. Finally, it's essentially treating the Z80 as a 16-bit CPU because it's not aware that 8 bit operations are much smaller and faster.

For fun, we can use -O3. This enables loop unrolling which is something we usually want to avoid on memory-constrained systems such as the Z80, but since other answers concentrate on cycle counting, performance is clearly desired over space saving. Here's what emerges:

    ld  hl, 0
    ld  (_data), hl
    ld  (_data+2), hl
    ld  (_data+4), hl
    ld  (_data+6), hl
    ld  (_data+8), hl
    ld  (_data+10), hl
    ld  (_data+12), hl
    ld  (_data+14), hl
    ld  (_data+16), hl
    ld  (_data+18), hl
    ret

16 cycles per "iteration". That'll do just fine.

pndc
  • 11,222
  • 3
  • 41
  • 64
  • 2
    Very nice new answer to an older question. – davidbak Nov 03 '22 at 16:02
  • 2
    Now that I reread your answer I do have one gripe: "the output of shoddy 1980s compilers". Shoddy? "1. Made of or containing inferior material. 2. Of poor quality or craft. 3. Rundown; shabby." Hardly fair. They were pretty good for the time (if not actually the best that could have been done). For one thing, try running clang+llvm self-hosted on a Z80 machine ... – davidbak Nov 03 '22 at 20:57
  • 2
    @davidbak: Another thing to consider is that even if a compiler's performance was sufficiently poor that one would have to write half of one's code in assembly language to yield tolerable performance, that would still often be a significant improvement over having to write 100% of one's code in assembly language, especially given that the portions of the code where assembly language would offer the most benefit would often, quite conveniently, be the ones that were easiest to write in optimal assembly language. – supercat Nov 04 '22 at 16:45
4

A little bit off-topic, but still:

Even after something like 40 years, I like very much writing assembly for my homebrew Z80/Z180 systems. On the other hand, I just did a minor update for my BASIC compiler, and once again found it very frustrating.

I think the reason is simple: After all, Z80 is an 8-bit machine. So, even with simple ANSI-BASIC (all variables being global), it's a real pain in the neck, as Z80 just doesn't match with integer data type of 16 bits (or more). An example: a generic bitwise logical operations (on 16bit integer) takes 6 instructions and 6 bytes. As for a signed comparison, calling a RTL-routine is likely the only feasible solution.

So, in order to reduce the size of compiled code (and improve the performance), the compiler tries to optimize. For e.g: an OR with an constant having MSB/LSB=0, you can omit some instructions. OK, but coding all those minor optimizations bloat the compiler and makes it really, really messy. And yes, most of the bugs in the compiler have been related to the optimizations (mostly the actual optimizations requires just few things to do, but checking whether or not the optimization is valid in the specific context requires often quite a of lot not-so-simple checking).

With C, and with any modern language, you would need to allocate space for variables from the stack. I wouldn't even dream trying that, as the amount of required code would be huge and performance very poor.

Interesting enough, the BASIC I'm using has also an interpreter, which actually is quite OK; Z80 has just about enough registers to implement an reasonably efficient virtual machine for the BASIC. And as a single bytecode can specify, for .e.g. an (16bit) OR operation, only relatively large programs end being shorter in compiled form.

With C or any modern language, you would need to allocate space for variables from the stack. I wouldn't even dream writing a Z80 compiler for any of them, as the amount of required code would huge and performance very poor. On the other hand, with C etc. I might try to write a compiler producing bytecodes.

OldTimer
  • 99
  • 1
  • The sentence "With C or any modern language, you would need to allocate space for variables from the stack. I wouldn't even dream writing a Z80 compiler for any of them, as the amount of required code would huge and performance very poor." appears twice. Which one do you want to keep? – Bruce Abbott Jun 06 '21 at 01:18
  • 1
    If one were willing to settle for a dialect of C that didn't support recursion, a linker could statically place automatic-duration objects in a manner that would allow functions that aren't simultaneously live to share storage. Most compilers for processors that would be less capable of using stack-based variables than the Z80, such as the PIC or 8051, routinely do that, and I think it's a shame Z80 and 6502 compilers didn't provide an option to do likewise. – supercat Jun 07 '21 at 15:54
  • 1
    A lot of the time, the key to getting good performance is to minimize the number of "register spills" or "register half-spills" in a loop, and on a CPU like the Z80 being able to the 8-bit halves of 16-bit register pairs individually can greatly assist with this. If optimal programming could get a loop down to one register spill, compiler-generated code that uses two spills more than optimal would be much slower. If, however, a loop would need a minimum of ten register spills, compiler-generated code that uses two more than optimal would still be about as fast. – supercat Nov 04 '22 at 16:49