Breakeven number of bytes for programmable DMA

Question

Contemporary with the old 8-bit chips, existed powerful but unsung programmable DMA controllers like the Zilog 8410 and the Intel 8257, that could copy data from one memory location to another, much faster than the CPU could.

Presumably the benefit of using them increases with the size of the block of data to be copied. If you just need to copy one byte, presumably it's faster to just do it directly.

What's the breakeven point? How many bytes do you need to be copying before it's faster to use the programmable DMA chip? To be specific, suppose we are talking about a common 8-bit computer system with a 4 MHz Z80, and the Z8410 DMA controller.

Spektre · Accepted Answer · 2018-01-14T09:56:36.813

each DMA transfer must be programed so you need to set some I/O registers of the DMA chip like:

BYTE DMA mode
WORD source address
WORD destination address
WORD block size

These are usually send by 8bit I/O requiring at least 1 instruction per 8 bit value but usually more depending on the DMA chip interconnection to computer and interface.

Now if the duration of this programming code and transfer is longer than CPU transfer you got your threshold. On Z80 you can use LDIR where you also need set up source/destination and code block. So the programming part is not that much different. But if the DMA chip is after some interfaces like i8255 than it requires much more code to program ... And your question starts to be a valid one.

To ease this up DMA chips usually provide command que which means you can program more DMA transfers ahead ...

You are still missing one very important point which makes DMA invaluable for some applications and that is that during the DMA transfer on some platforms the CPU runs too and can do stuff that would not be possible otherwise (but you need to take contention into account if present).

for example of DMA chips usage on 8bit computers see:

Z8410 DMA chip as GPU?

Anyway to answer your question you would need to specify the architecture more closely. Let us consider 3.547MHz Z80 CPU (ZX128+ with DATAGEAR which is Z8410 based):

opc      T0 T1 MC1   MC2   MC3   MC4   MC5   MC6   MC7   mnemonic

EDB0     21 16 M1R 4 M1R 4 MRD 3 MWR 5 NON 5 ... 0 ... 0 LDIR

so the theoretical transfer rate on CPU (without contention) is

3 547 000 / 21 = 168904 Byte/s = 164.9 KByte/s

It is a shame Zilog encoded the instruction under 0xED prefix without it would be 4T faster leading to 208647 Byte/s.

The 3.547MHz Z8410 DMA chip has measured transfer rate ~865.6 kB/s which is ~5.25 times faster.

So just compare programing code on CPU and DMA and the duration difference must be shorter than Time boost gained by the DMA.

This is Z80 code for DMA doing the same as LDIR taken from YourSpectrum #04 January 1998:

    org 45000
    ld hl,50000 ; source address
    ld (source),hl
    ld hl,6911 ; block size
    ld (len),hl
    ld hl,16384 ; destination address
    ld (dest),hl
    ld hl,dma
    ld b,length ; size of DMA command stream
    ld c,11
    otir
    ret
dma:  defb #C3,#C7,#CB,#7D
source:defw 0    ;source address
len:  defw 0    ;length of data to be transferred
      defb #14,#10,#C0,#AD
dest:  defw 0    ; destination address
      defb #92,#CF,#B3,#87
length equ $-dma

As you can see you can hardcode the stuff into 18 byte defb array:

dma:ld hl,cmd ; 10T address of stored  command stream
    ld b,len  ;  7T size of stored command stream
    ld c,11   ;  7T DATAGEAR port
    otir      ;373T = 17*21 + 16
    ret       ; 10T
cmd:defb #C3,#C7,#CB,#7D
    defw source_address,block_size
    defb #14,#10,#C0,#AD
    defw destination_address
    defb #92,#CF,#B3,#87
len equ $-dma

Here the same on Z80 only:

cpu:ld hl,50000 ; 10T source address
    ld bc,6911  ; 10T block size
    ld de,16384 ; 10T destination address
    ldir
    ret         ; 10T

Now just compare the timings of the code without transfer:

DMA: +407T 
CPU: - 40T
----------
diff: 367T

dt = 367T / 3547000Hz = 0.00010346771919932337186354665914858 s

Now we need to compute the transfer size which will have ~same dt time boost

x / cpu_rate = x / dma_rate + dt
x * dma_rate = x * cpu_rate + dt * cpu_rate * dma_rate
x * dma_rate - x * cpu_rate = dt * cpu_rate * dma_rate
x * (dma_rate - cpu_rate) = dt * cpu_rate * dma_rate
x  = dt * cpu_rate * dma_rate / (dma_rate - cpu_rate)
x = 0.00010346771919932337186354665914858 * 168904 * 886350 / (886350 - 168904)
x = 15489951.555342542994079503806033 / 717446
x = ~21.59 Byte

So the theoretical transfer block size threshold is 22 Byte. Btw the ld b,N and ld c,N can be optimized into ld bc,NN which is 4T shorter and it would a bit change the numbers ... also ldir can be unrolled into n times ldi which leads to cpu_rate = 221687.5 Byte/s but that is impractical for larger blocks (unless looped).

[Edit1] more optimized code

DMA:

dma:ld hl,cmd ; 10T address of stored  command stream
    ld c,11   ;  7T DATAGEAR port
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    outi      ; 16T
    ret       ; 10T
cmd:defb #C3,#C7,#CB,#7D
    defw source_address,block_size
    defb #14,#10,#C0,#AD
    defw destination_address
    defb #92,#CF,#B3,#87

CPU:

cpu:ld hl,50000 ; 10T source address
    ld de,16384 ; 10T destination address

    ldi         ; 16T
    ldi         ; 16T
    ldi         ; 16T
    ...
    ldi         ; 16T
    ldi         ; 16T
    ldi         ; 16T

    ret         ; 10T

times equation:

DMA: +315T 
CPU: - 30T
----------
diff: 285T

dt = 285T / 3547000Hz = 0.000080349591203834226106568931491401 s

x = dt * cpu_rate * dma_rate / (dma_rate - cpu_rate)
x = 0.000080349591203834226106568931491401 * 221687.5 * 886350 / (886350 - 221687.5)
x = 23.75357

So the theoretical threshold for optimized transfers is 24 Bytes.

Note the number of I/O waitstates and the number of MREQ waitstates can be entirely different depending on what else beyond the DMA chip lives in the system. — tofro, Jan 12 '18 at 13:09
@tofro yep that is what contention is all about however in ZX machines is usually CPU clock put on hold instead of inserting Wait states on top of that Z8410 usually BUSRQ the CPU and additional HW can do anything too .... — Spektre, Jan 12 '18 at 13:58
Great writeup Spektre. In addition i want to stress the basic fact, that all commands (addresses etc.) for DMA have to go twice over the bus, into the CPU and then into the DMA, while the same command data has to be transfered only once into the CPU when done internaly. Thats a mechanic true for all CPU/DMA combinations, not just the Z80. — Raffzahn, Jan 12 '18 at 15:28
Performancewise the OTIR might be unroled, thus pushing the DMA threshold down to ~20 bytes. At that point it will be also important to see the over all impact. In a real application such a function will be parameterized, which in itself may waste many more instruction in shovling parameters arround before doing any real work. So outside the narrow scope of the core fuction, the threshold is eventually visibly beyonf 30 bytes. (Interesting sinde note, on a /370 with a simple string move instruction, the average string is just 7 bytes) — Raffzahn, Jan 12 '18 at 15:35
@Raffzahn Yep there usually push/pops around and some sync timing wasting more time ... On the other hand DMA commands might get optimized (like setting all the stuff just for first transfer but the others could be organized to take advantage of what is already set). 30Bytes looks reasonable for example ZX MultiTech is using 32Byte Transfers (in heavily contented region) to multiply attribute resolution by 8. So the 32 must be above this threshold. — Spektre, Jan 12 '18 at 15:58
@Spektre Well, as said, it depends a lot on the environment. as your calculations have shown, in at the right place it already pays early. It may even pay before, when above routine is part of a programm where the needed addresses are already loaded in registers by prior calculation. That ofc is next to impossible to be described in a general way. Oh, and then there is also a lower threshold even for using LDI(R): When loading and storing the bytes in question either as bytes or words directly is faster than seting up the registers for LDI(R). Roughly this is up to 5-6 bytes. — Raffzahn, Jan 12 '18 at 16:07
You compare DMA to a CPU using LDIR, which is the most convenient, but not the fastest method. Agree you must use something, though. So, for a very clever CPU-based method, the threshold would be higher, possibly even up to twice as high. — tofro, Jan 13 '18 at 10:42
@tofro I'm not sure what you're trying to tell. Why has it to be twice as high, and what's a clever designed CPU method in that sense? — Raffzahn, Jan 14 '18 at 09:04
@Raffzahn the "twice" is purely hypothetical. That would be my threshold for actually investing in DMA. — tofro, Jan 14 '18 at 09:24
@tofro I added edit1 with numbers for optimized code (still without contention). The threshold did not change too much (22 -> 24). I am not aware of any faster than unrolled ldi CPU transfer method. Of coarse taking contention into account might change things a lot as it is unclear to me if the contention affect DMA in the same rate or not as for CPU transfer (as DMA spends less time per transfer is possible it hits contention block less likely favoring DMA trabsfer rate but that is pure speculation on my side) The DMA rate was measured on real machine and probably contain contention. — Spektre, Jan 14 '18 at 09:52
@tofro the CPU rate is without it so the real CPU rate would be slower lowering the threshold a bit (but still not by much I think). — Spektre, Jan 14 '18 at 09:54
Word of warning: the sequence given is wrong. You don't need the C7 CB to start with because C3 does that. The C0 AD should be 80 AD or you enable a half cocked DMA mid program. Just verified this on real silicon — Alan Cox, Sep 14 '19 at 22:00
@AlanCox I have not much experience with Zilogs DMA chip I only wrote emulator of it form datasheet (and that was years ago and still not working 100% as should as I do not have very good docs and no experience and access to the real IC) that is why I used published sequence instead of mine. However If you're right the threshold value would change slightly in favor for the DMA as the cfg would be 2 BYTES shorter ... — Spektre, Sep 15 '19 at 06:43

Breakeven number of bytes for programmable DMA

1 Answers1

Linked