ZX BASIC REM statement overhead

Question

Since my other question has been answered and has given me a way to do high-precision timing, I've been experimenting with it a bit. The first thing I did was write a simple benchmark using the FRAMES variable as a time reference. When doing that, I noticed something odd. I wondered if there was any overhead to the REM statement, which there was (albeit very little). However, I noticed the overhead was smaller if there was no text after the REM. The line would execute in 0.04 or 0.05 PAL frames if the line was simply REM, but would take 0.06 PAL frames if the line was REM ANYTHING GOES HERE. It seems not to matter if the text is one character or a hundred characters. It always takes slightly longer if there is any text. This was completely reproducible all the dozens of times I tried it. Note that I am currently using an emulator, however it is a cycle-accurate emulator and so should be identical in behavior to a real Sinclair ZX Spectrum running 128 BASIC. The benchmarking code is below:

  10 LET T1=PEEK 23672
  20 LET T1=PEEK 23673*256+T1
  30 FOR I=0 TO 100
  40 GO SUB 130
  50 NEXT I
  60 LET T2=PEEK 23672
  70 LET T2=PEEK 23673*256+T2
  80 LET TD=T2-T1-67
  90 IF TD<0 THEN LET TD=0
 100 PRINT "SECONDS","FRAMES"
 110 PRINT TD/5000,TD/100
 120 STOP
 130 REM LINE GOES HERE
 140 RETURN

What could be causing this behavior? Why would the presence of text in a REM statement affect the time it takes to execute in any way? I can understand why the overhead is not zero, but not this. My suspicion is that it is related to tokenization. Perhaps the interpreter takes in the token and the rest of the text in separately regardless of whether or not the text is to be interpreted or not. I can't verify this.

Why does REM have less overhead than REM FOO in 128 BASIC?

EDIT: I tested this more with different ROMs and BASIC versions, and I've concluded that there must have been a problem with my test. I thought I had already tried this code and reproduced the results, but when I tried again and actually wrote down the results, it seems there is actually no change in the number of frames that pass by when interpreting REM with and without accompanying text. This makes it seem like the only reason my previous code gave the results it was giving had to do with the time it took GO SUB to jump to its target, or something along those lines. I used this code:

  10 LET T1=PEEK 23672
  20 LET T1=PEEK 23673*256+T1
  30 FOR I=0 TO 1000
  40 REM THIS IS A COMMENT
  50 NEXT I
  60 LET T2=PEEK 23672
  70 LET T2=PEEK 23673*256+T2
  80 PRINT T2-T1

I ran this on 48K, as well as on 128K with both its 48 BASIC and 128 BASIC interpreters, replacing line 40 with both a REM and even removing it all together. The results of 1000 loops of this test in PAL frames elapsed are recorded below. Clearly, something was wrong with my previous methodology.

	+REM +text	+REM −text	−REM −text
48 BASIC, ZX48	241	241	221
48 BASIC, ZX128	238	238	219
128 BASIC, ZX128	372	372	311
48 BASIC, Pentagon	227	228	209
128 BASIC, Pentagon	338–343	338–343	286–288

The results are telling. When not using GO SUB, there is no overhead incurred by adding any text to REM (but still some for interpreting the command itself, albeit not much). 48 BASIC is the fastest, though slightly slower on 48K than on 128K running 48 BASIC. 128 BASIC was by far the slowest.

I will do more tests to try and find out what caused the previous behavior that I misattributed to REM.

You might want to restructure your test programm a bit and have the test routine right at the start. With like 1 GOTO 1000 10 testprgramm from here until 998 and 999 RETURN. This minimizes GOSUB overhead in any Basic interpreter. After all, you want to measure the target, not the GOSUB. | This works due the way a GOSUB searches for it's target. The great part is that this structure will never have an negative impact, no matter what BASIC is used. With most BASICs it will have a great reduction of search time, as only one line (#1) has to be skiped to find the subroutine start. — Raffzahn, Feb 25 '18 at 11:10
(Not enough space to describe the details here, but you may aswell open a Question about the way GOSUB works in different BASIC implementations :)) — Raffzahn, Feb 25 '18 at 11:15
Question, does it make any difference being in 48 or 128 BASIC? — Rui F Ribeiro, Feb 25 '18 at 15:35
@Raffzahn that's what the -67 part is. Testing showed 67 frames worth of overhead for 100 iterations of an empty loop (i.e. jumping straight to the RETURN). — forest, Feb 26 '18 at 01:59
That won't work, as the time it takes to fine a line depends on the amount of lines to be serched. for a reliable test environment you want to have this constant, and not attributed with a asumption. — Raffzahn, Feb 26 '18 at 03:40
@Raffzahn Well this is for a single line. I'll recompute it if I start adding a lot of lines before the RETURN. — forest, Feb 26 '18 at 04:02
You mean tinkering arround and assuming that results are comparable is favourable to a robust test setup? I wouln't dare to call that enginering... — Raffzahn, Feb 26 '18 at 04:12
@Raffzahn With PAUSE especially, it's possible to determine at least an approximate amount of overhead. Currently I don't have any other ways to robustly test performance from the BASIC interpreter. When the math is adjusted so a hundred iterations over a single blank line reports 0 frames passed, then it's safe to say that (almost) any other lines can have their delays measured with at least a rough accuracy. It's certainly not a robust test setup, but it's all I have when working with the constraints of the BASIC interpreter. — forest, Feb 26 '18 at 04:29
I also tested to make sure more lines between the target of GO SUB and RETURN do not affect the benchmark by putting in a number of sample lines and testing to make sure the delay increased in a linear fashion (so e.g. the RETURN statement doesn't take significantly longer if it has to jump up a couple lines farther). — forest, Feb 26 '18 at 04:31
Of course you have. The basic issue here is if you want to wiggle with parameters every time something gets changed, hoping to get a compareable result afterwards, or if you rather go for an enginering aproach, idenitfying the relevant pices and then eliminating them, so the timing relevant parts of the setup don't change when the programm gets changed either way. After all, eliminated elements don't need to be aproximated in the first place. isn't it? - Also, a return will always use the same time, as it's target is taken from the return stack, not searched for. — Raffzahn, Feb 26 '18 at 05:19
@Raffzahn How would I go about performing a more accurate benchmark within the constraints I have? All I could see was 1) measuring the overhead of the loop itself and 2) checking that benchmarking multiple lines causes a linear increase in the time taken, in effect ensuring that the overhead of the loop is independent of the lines being benchmarked. If that is not the case then of course I am doing it wrong if I try to benchmark anything larger than one line. I am not familiar enough with the Z80 ISA or the Spectrum ROM to directly insert more precise benchmarking code into the interpreter. — forest, Feb 26 '18 at 05:27
@Raffzahn You were right. When I removed GO SUB as a variable to the equation, the behavior changed (though I still don't know why it had that behavior in the past). See my edit. — forest, Feb 27 '18 at 02:49
@forest When designing a teste setu it's never about more accurate, it's always about less noisy. Never trust people that give you numbers to the 3rd or more decimal. That's a pseudo accuracy. When looking at a setup, always try to eliminate other influence. Like you did when removing the GO SUB. Or, if there are other reasons (like extendable structure), minimize teh influence of any noise. That was my aproach. Bottom line, noone needs 5 valid digits. All you need is up to thre but solid ones. — Raffzahn, Feb 27 '18 at 14:44

score 15 · Accepted Answer · edited Feb 25 '18 at 13:02

I just dug out a copy of The Complete Spectrum ROM Disassembly in order to see what's actually going on. The answer is a bit more complex than you'd expect.

The main interpreter loop is implemented as a three stage process:

First, the line starting from the current execution point is scanned for either : or a newline. The first byte, which contains the token for the statement, is looked up in a table of statements which controls the next stages
Then, parameter parsing takes place, which is selected by a list of parameter classes
Once parameters are parsed, the execution routine for the command is jumped to.

In the case of REM the parameter class 5 is used, which is for a list of operands of varying kinds and therefore defers processing to the command implementation itself.

The REM statement's implementation is, as you'd expect, to do nothing: it pops its parameter off the stack and then falls directly through into the code to work out what the next action to take is.

This analysis (which is basically a confirmation of @RichF's original assumption) suggests that you should see a linear increase in execution based on the length of the statement, not a jump on whether there's any text at all.

But: the overhead for each statement executed is much higher than the overhead for processing the line to find the end of the statement. I therefore suspect that the actual result is being lost by some secondary effect that isn't actually related to the efficiency of the translation. Perhaps a memory access issue (maybe something gets moved in or out of contended memory due to the different sizes of the program), or maybe the efficiency of the implementation of GOSUB might be the issue; I didn't really research that.

Some suggestions for further experiments:

Try a program with 10 REM statements before the RETURN rather than just one, which should reduce the overhead of the GOSUB in relation to the REM
Try embedding this test in a larger program in order to see if the length of the program itself is a factor

The first suggestion is a great idea, the second not, as a larger programm (aka one where the Test function is all the way at the end), just increases overhead again, as the GOSUB will take longer to find the target line. — Raffzahn, Feb 25 '18 at 10:52
Jules, I'm glad you had the capability and time to investigate this down to the source. Not sure how I feel about may be being right when I thought I was wrong. — RichF, Feb 25 '18 at 13:18
Thanks for the detailed analysis. In the end, does this mean that the reason for the delay is still unknown? I suppose I will have to do more testing. Anyway the overhead of the GO SUB should be non-existent in the final analysis as I take that into account by subtracting 67 frames from the result in TD. — forest, Feb 26 '18 at 01:58
As GOSUB appears to be implemented by a linear scan across the entire program, which will always include some access to contended memory, I wouldn't put too much faith on your 67 frames number: the length of time spent in the scans to find the right line could vary substantially depending on precise timing (i.e. whether it occurs during a screen redraw or in the blanking interval) and therefore might be changed depending on what your test statement is doing. I think @Raffzahn's suggestion of moving the statements to test to the start of the program is good; that should reduce the variance. — Jules, Feb 26 '18 at 08:55
I therefore suspect that the actual result is being lost by some secondary effect that isn't actually related to the efficiency of the translation. You were exactly right. It was GO SUB mucking with the test results. There is no noticeable overhead when it is removed and the REM is put in the very middle of the loop. — forest, Feb 27 '18 at 02:51

RichF · Answer 2 · 2018-02-25T14:59:13.117

1

forest, I suggest you select Jules answer instead of mine. It appears my first assumption (below bottom horizontal rule) was correct after all, and he explains why. Let's see if math can shed light on your timing results.

1,000,000 usec/second
÷ 50 frames/second (PAL)
=20,000 usec/frame
* .045 frames (empty REM line)
= 900 microseconds

20,000 usec/frame
* .06 frames (REM line with additional text)
= 1200 microseconds

1200 usec -900 usec = 300 usec (observed difference)

Let's assume the CPU did execute a jump command. Using my z80 manual, I see that a JP pq (jump immediate) command on the Spectrum's 3.5 MHz CPU would take approximately 3.33 usec. Even with extra overhead such as the interpreter moving pointers around, that is a heckuva lot less than 300 usec.

Change gears, and assume that the interpreter will skim thru extra text after REM, which is what Jules and tofru seem to agree is actually happening. One way to do this is with the z80 CPIR (compare and increment) command. For your CPU this takes 5.33 usec + 2.33 usec per extra byte, plus the likely significant overhead for setting up registers, and doing "interpreter stuff" such as calculating where the next line of BASIC is. The point of this paragraph is to show that varying quantities of text after the REM will not take very long to skip thru. The overhead in setting up and handling the skip can often be longer than the skip itself.

I don't know the actual granularity your timing method, but my guess is that the CPU can do a lot of stuff within each grain. Raffzahn has suggested you rerun the test with multiple REM lines, which will likely help you get a better timing resolution.

// wrongly corrected answer

One possibility is that the REM is normally implemented as some sort of jump. It would then have the same time overhead whether you are jumping past 1 character or 80. However, if no characters followed REM on the line, then the jump would not be necessary, instead immediately and naturally flowing to the next line.

// original answer; probably in the ballpark

I think it likely that the interpreter has not been fully optimized. It is simply not smart enough to stop parsing when it reaches the REM text/token. Think of it this way. How is the interpreter going to find the RETURN text/token? Very likely it just parses character by character until it gets to it. If the REM line has text, that text still needs to be parsed, or maybe skimmed thru, to get to RETURN.

A BASIC interpreter could be built using linked lists to each line, but why bother? BASIC optimizers and compilers would get rid of the REM lines in their entirety.

edited Feb 25 '18 at 14:59

answered Feb 25 '18 at 02:11

RichF

9,006
4
29
55

This explains why REM has a non-zero overhead, and I already understand that. What I don't understand is why the presence or absence of text makes a difference. – forest Feb 25 '18 at 02:14
@forest what do you expect to happen once the interpreter interprets the REM? It can't magically know that the next text of significance is the RETURN token, and where it is. It has to parse thru, character or token by character or token until it gets there. If your REM line contains additional text, it will be parsed or skimmed. ("Skimming" might be implemented where the interpreter knows to stop actual parsing, but search for the next line.) – RichF Feb 25 '18 at 02:19
I would think that the parser would fetch the whole line at once (the size of the line must not matter, as one character or a hundred don't change the outcome), then when it sees the REM, look up the action to take in its ROM, see that it's a no-op, and go on to the next line. If it had to "skim" the entire line, then I would expect that REM 1 would be faster than REM <300 CHARS>. The fact that the overhead increase is not linear (as I would expect if fetching a line were slow) and seems only to occur when any text is added to the statement is what makes it so confusing. – forest Feb 25 '18 at 02:21
Okay, I can imagine an interpreter that would treat REM as an actual executable statement meaning "jump to next line". However, the editor would then have to be constantly aware that edits to that line would change the jump location. If I were writing an interpreter, I wouldn't complicate things like that. Simply parse, from start to end. If the programmer doesn't like that, he can use a compiled language, a BASIC optimizer, or a BASIC compiler. I don't know for sure, but I'm betting both of the latter existed at the time, available from third parties. – RichF Feb 25 '18 at 02:27
But again, if it parsed from start to end, why would the increase in overhead occur when any text is added? Why would a 1 byte line take 0.04 PAL frames, a 2 byte line take 0.06 PAL frames, and a 200 byte line also take 0.06 PAL frames? It's like there's some sort of barrier between the REM and the rest of the text that slows it down. – forest Feb 25 '18 at 02:29
1

Oh, sorry. I completely missed the "any number of extra characters takes the same fixed time" part. My bad. One answer is that the REM is usually implemented as a jump. However, if 0 characters follow the REM, an actual jump need not be used; the program will immediately and naturally flow to the next line. – RichF Feb 25 '18 at 02:38
"Jump" is maybe not the right word. The interpreter simply needs to advance to the next line after a REM token has been encountered - An empty REM simply has to advance the read pointer across the end-of-line marker, a string after the REM causes an additional read of the string length and an add of that length to the interpreter pointer - this simply takes a bit longer. – tofro Feb 25 '18 at 13:31
@tofro agree; I meant "jump" in a loose way, and not a literal z80 JP pq instruction. You seem to be agreeing with Jules -- no jump of any sort is occurring, and the entire line after the REM is skimmed as I had first thought. – RichF Feb 25 '18 at 13:54
The REM statement's 'parameter' can hold bytes of any value, which is why it can be used to store machine code programs. So you can't CPIR/etc through it, you could bump into another 0Dh that isn't the end of line. The interpreter uses the line length, which is stored after the line number at the start of every BASIC line, if I remember correctly. This is why skipping to the next line takes a fixed time. Much better for BASIC than parsing every character in the REM. – TonyM Feb 27 '18 at 14:11
An interpreter's implementation could keep the pointer to the next line (which is stored at the line beginning, at least for MS BASIC implementations) and advance to the next "line" very fast. Useful not only for REM, but also in case of an IF-THEN and the condition evaluates to false. Even though this would be cheap to implement, MS BASIC derivations typically omit such optimizations (equally for a GOTO to the current line where to keep the start pointer of the current line) ... Do someone know an interpreter with such optimizations? – Johann Klasek May 02 '19 at 18:41
@JohannKlasek Interesting point. I do not know the answer to your question, and the system does not raise this page to the top of the Active list because of new comments. Since I wrote the answer to which you commented, I was notified, but no one else will be. Thus very few people will see your comment. To get an answer, ask a new question. If you think it is relevant you can provide a link to this question in your new one. It wouldn't surprise me if there are already relevant BASIC interpreter optimization questions, so you might try searching first. – RichF May 02 '19 at 22:24

ZX BASIC REM statement overhead

2 Answers2