15

I came across a post that states that Unix uses null-terminated strings, ASCIZ, because it was a feature of the PDP-7. This triggered my reading on the CIS instructions in the PDP-11, but these were pascal-style strings with the "n" register holding the length.

I looked over the PDP-7 and -4 material and I can't find anything like the CIS instructions. This is not surprising given the era. But I can't find anything that seems to suggest ASCIZ was directly supported. There's the various I/O instructions, but they all load three chars into/out of a register and I don't see anything that means a length couldn't be used. The example code doesn't show anything interesting either.

So was null-terminated strings on the PDP-7/4 something in the hardware or simply a programming style? I guess if you load multiple characters at once you need to set any leftover /3 to null?

chicks
  • 397
  • 1
  • 4
  • 15
Maury Markowitz
  • 19,803
  • 1
  • 47
  • 138
  • 3
    Didn't the PDP-4 have SZA - Skip if AC=0? Likewise the PDP-8. That's something to prefer zero terminated strings - at least as long as every character is handled in AC. Same way as any architecture that sets flags according to the last load does prefer zero terminated strings by saving the need for a compare instruction. – Raffzahn Jul 17 '22 at 02:48
  • 3
    I'd guess 'programming style'. TOPS-10 frequently used ASCIZ strings (where it wasn't using SIXBIT) , and as far as I know, there was no particular hardware support beyond testing whether an accumulator was zero. To my mind, once you've decided not to use counted strings, there's only two sensible choices for terminators - all zeroes or all ones, with a slight bias to the former, as in @Raffzahn's comment. – dave Jul 17 '22 at 03:31
  • 2
    Null-termination is definitely a programming style and not a hardware thing. I've no question about it. But I only have a limited view from living that period of time, too. So it's not definitive. But I never saw a hardware situation that predated a software one. – jonk Jul 17 '22 at 09:46
  • 2
    To @another-dave's comment about termination: another perfectly reasonable alternative, iff you know your string is at least one character long, and sticks to 7-bit ASCII, is to set the high bit of the last character. That's how BASIC tokens are terminated in the ZX Spectrum ROM, for example. – Omar and Lorraine Jul 17 '22 at 13:29
  • 3
    @OmarL - Sure, but that requires storing your 7-bit character in 8 bits. Not the case for TOPS-10: the usual choice, hardware supported, was to store five characters in each 36-bit word. In the PDP-7, as per the OP's question, there would be three 6-bit characters per 18-bit word., so no flag bit available there. – dave Jul 17 '22 at 13:57

2 Answers2

10

You are correct. There is no special support on the PDP-1/4/7 for ASCII null-terminated strings. Or strings of any kind, really. It's up to the programmer to decide how they want to represent strings. There are multiple ways to store text on these machines. You might find this other question "How did the PDP-8 handle strings?" relevant. (The PDP-8 has a similar if further-simplified architecture to the PDP-7.)

It was common to store 6-bit characters (often a subset of ASCII without lowercase or control characters since that can be converted with a simple bitmask and arithmetic) packed multiple to a word. This is more compact for storing the string, but requires more complicated routines to extract the nth character. For string processing, it's often easiest to handle one character at a time. In that case, one just tests if the accumulator is zero. (For storage of large amounts of text, this usually is too wasteful, with only 6/7/8 bits in each word being actually used.)

I have written most of a forever-unfinished toy Forth for the PDP-8. I chose to use ASCII (since the IO devices use ASCII), and I chose null-terminated strings for string operations (since I believe in repeating the errors of history). Error messages and other long strings are stored in packed ASCII. So "HELLO" is stored as three words: HE LL O[NULL]. When the program goes to print this, it runs a routine for the 0th character of the string at this pointer, and the 1st, etc. It then handles those characters as plain ASCII, and will exit the printing routine when the accumulator is 0. There's also a routine to translate a NULL-terminated string into its packed representation. This sort of packing/unpacking was a common design pattern on word-oriented machines, whether using Pascal or C-style or some other form of string representation. (The original Pascal compiler for the CDC 6000 packed ten characters per 60-bit word.)

RETRAC
  • 13,656
  • 3
  • 42
  • 65
10

a post that states that Unix uses null-terminated strings, ASCIZ, because it was a feature of the PDP-7.

Yes, ASCIZ strings are especially supported by the PDP-7 due existence of the SZA instruction which delivers a zero cost test for zero (pun intended), thus outperforms any other way string termination by memory footprint as well as execution time.

I looked over the PDP-7 and -4 material and I can't find anything like the CIS instructions. This is not surprising given the era. But I can't find anything that seems to suggest ASCIZ was directly supported.

It doesn't need some high level CIS type to benefit from certain data types. Preference may come from way more mundane items, like havng a test for zero implied or at low cost.

There's the various I/O instructions, but they all load three chars into/out of a register

Don't forget all character I/O, most notably teleprinter and paper tape. Like TLS or RSA. They operate on single characters from or to the bits of AC.

Equally important, string handling is more often about touching characters within a string than the string at whole. As soon as a string is handled character by character character based conditions - like end of string - become an inherent advantage.

Last but not least, strings are not blocks.

I don't see anything that means a length couldn't be used.

It would be a bad machine if it would't allow the use of length terminated strings. As so often it's not about what can or can not be done, but what is more efficient or at least less cumbersome.

So was null-terminated strings on the PDP-7/4 something in the hardware or simply a programming style?

Well, any Turing Complete computer can do stings any way imaginable. Assuming the PFP-7 is one, might mean it comes down to programming style, after all, the machine does what it's told. Still, since these are real computers in the real world with constrains like memory size or speed, some methods may perform better on specific computers than others. In case of the PDP-7 it's Zero Termination (ASCIZ) due the fact that

  • its memory access is slow
  • it has only one general purpose register (AC)
  • character handling has to go thru this register
  • it got any SZA instruction

SZA(Skip if Zero Accumulator) tests AC against a value of zero and branches (skips). Since AC has to be loaded with each character to be handled, SZA can be used to test termination. Unlike a compare (SAD x), SZA does not need to access memory for a value to compare against, as zero is an implied. It's simply faster than any other method.


To get a fair comparision, let's look at the some ways to handle strings:

Length Terminated Strings

Length terminated strings carry (usually in front) a length field holding the number of addressable units (characters or words) used to store the string. Nowadays often called Pascal-Strings.

Length termination got important advantages:

  • Blocks and strings are the same.
  • Its content does not need to be processed
  • It can contain any possible symbol (code value).

This is payed for with some requirements:

  • A length must be present
  • Iteration (Handling) requires
    • A pointer to access the actual element and either
      • a counter to keep track or remaining characters, or
      • an end pointer (address)
    • Comparison against counter or end pointer

These requirements are no issue when it comes to processors with a lot of registers and/or fast memory access. Handling 3 values during iteration (position, termination and actual character) gets cumbersome on a CPU with only a single versatile register, or slow memory access like the PDP-4/7 series. With it's single AC, the PDP-4 needs to hold at least pointer/counter in memory, resulting in slow execution due frequent memory access for

  • Indirect access
  • Pointer increment
  • Pointer load for comparison
  • Comparison

Character Terminated Strings

Character (or marker) termination (*1) in turn does not record the length of a string in an additional field, but reserves a special character value to mark the end of that string. This can be any value. Of course it's helpful to use one that does not occur often :)

A common example for character termination are DOS strings, terminated by a dollar sign ($).

Disadvantages of character termination are:

  • Not all characters can be used
  • Content needs to be processed
  • Character need to be loaded
  • Characters have to be compared against termination value

On the plus side, there is

  • No need for a counter
  • No need for an end pointer
  • Content usually has to be loaded anyway for processing

Using character terminated strings is advantages on a PDP-7 as no counter/end pointer is needed, saving several costly memory operations - whcih are replaced by a simple compare of the character in transit against the termination value.

While marker terminated is a generic idea, there are some common used sub classes: Zero terminated and Flag terminated.

Zero Terminated

Zero Termination is a basic application of character termination but using special properties of the value zero - for example that many CPUs either mark any loaded zero with a direct testable zero flag (prominet examples 6800 or 6502. Or have a branch instruction that issueing an implied test for zero.

And that's exactly the case for a PDP-7. SZS tests AC for zero and allows to branch accordingly without the need for any additional compare instruction, and especially without the need to access memory.

Flag Terminated

The use of Flag Termination reduce the usable code set further (*2) to reserve half of the character values for a flag marker - e.g. use only 7 bits of a byte for character values, while the 8th is used as flag. While working much like basic character terminated strings, it holds a few advantages:

  • No extra byte needed for termination
  • Some CPU may use an implied test

Especially the later is true for all CPU that set a condition flag according to the value of a character handles, allowing an implied test, much like with described with zero termination.

Flag termination was somewhat popular after the 8 bit byte became canonical, while the majority of character handling still used 7 bit code - thus enabling the 8th bit as flag wthout much cost.


Bottom Line: ASCIZ is preferred handling for all CPUs that offer an implied test for Zero - the PDP-7 being one of them.


Davisbak made a great point by citing Mr. Donald Knuth:

Special machine instructions that test against zero are why Knuth points out in various places that he prefers to loop downwards towards zero. (One such place is in the middle of his excellent article Structured Programming with go to statements (1974) - see first paragraph pg 268.) (Also check out the quotations he used at the beginning!)

Such tests may seem so obvious to today's readers, that they have a hard tim to praise the advantage machines have in this case over others who don't - which includes today's most prevalent architecture: x86 (*3).


*1 - Here is also a subclss of word terminated strings, but it follows essentially the same rules as character terminated

*2 - Or extended, depending on POV.

*3 - Or not, as x86 it does in fact have a free test for zero in form of the JCXZ instruction, except it's restricted to work with only one register (CX), which at the same time is not as versatile as the main accumulator (AX).

Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • A minor drawback of Length Terminated Strings is that you have a maximum string length, depending on the size of the length field you chose. – John Dallman Jul 17 '22 at 20:53
  • 1
    Special machine instructions that test against zero are why Knuth points out in various places that he prefers to loop downwards towards zero. (One such place is in the middle of his excellent article Structured Programming with go to statements (1974) - see first paragraph pg 268.) (Also check out the quotations he used at the beginning!) – davidbak Jul 17 '22 at 21:08
  • @JohnDallman - that really bit you in some of the original Pascal implementations which used a single byte for string length. The ones that went to two bytes for string length were a lot easier! In those days, considering machine resources, 65K byte strings were ... pretty darn large! And thus mostly sufficient. – davidbak Jul 17 '22 at 21:10
  • Speaking from the United States, that 8th bit in each byte of character data was totally free. No other use for it at all, than for flags for various purposes. Hard to understand anyone arguing otherwise, to be honest. And a lot of machines would have an easy test for a negative number too ... – davidbak Jul 17 '22 at 21:13
  • @davidbak Great Knuth pointer. Thanks. Now that part about the 8th bit is only true if we confine it to ASCII and machines with 8 bit bytes. It doesn't work on /370 using EBCDIC or on PDP-4/7 like machines with 6 bit bytes. In either case using an additional bit will increase character size. {Side note, back in the days 7 bit was quite fine in most countries. Not at least for keeping compatibility with 7 bit transmission} – Raffzahn Jul 17 '22 at 21:40
  • @davidbak Also, regarding the 'Special machine instructions that test against zero' is the whole point her and why ASCIZ is preferred by PDP-4 hardware. The whole issue seems just so common to today's readers, that they have a hard time imagine that this little difference is at least as relevant to design as some CIS, if not more. I'd like to stel that quotation if you don't mind. – Raffzahn Jul 17 '22 at 21:46
  • 2
    Things were so much simpler then: One machine instruction >= 1 cycles, and shorter instructions saved space (and were important if you had short relative jump instructions and could use them because your code was tiny). None of this overlapping execution, multiple ALUs, out-of-order, branch-prediction, microop caching, speculative stuff. Then, that instruction you saved by having a (short) combined test&jump over two instructions or one longer instruction paid off for sure! These days: not so much ... (except microcontrollers I suppose). – davidbak Jul 17 '22 at 22:49
  • "Length Terminated" (wouldn't it be "Length Prefixed" or "Counted"?) strings are also more suited to zero-copy parsing since you can take slices of them without needing to modify them. That and their use in int 10h BIOS APIs is why I implemented a version of them inspired by Rust slices in C for one of my hobby projects. – ssokolow Jul 18 '22 at 10:20
  • @ssokolow :)) Prefixed would imply that the length is always in front of a string, which is a separate feature for handling, but not needed for it's basic function. While counted may be fine as language specific vocabulary, it does Likewise, it does not state what is counted, or in which context. I tried to stay as abstract as possible to avoid being associated with either language or implementation, as this is about the underlying principle. It's about how to find the end (terminus) of a string, isn't it? – Raffzahn Jul 18 '22 at 10:41
  • 1
    Counting towards zero in the reason I liked the AOBJN instruction so much. Pdp-6 and pdp-10. – Walter Mitty Jul 18 '22 at 19:46
  • @Raffzahn To me, "Length Terminated" at least strongly implies that the length is placed at the end of the string, similar to a null terminator or a SCSI terminator. – ssokolow Jul 19 '22 at 05:25
  • @ssokolow this might be the fine difference between a terminator and terminated. Length terminated says simply that the size is defined by a length, not where it is stored. it's convenient to have it upfront, but it may reside at another place/structure. Just think how MS-BASIC does the job: Beside the name, the variable table holds a pointer to the string and a length of that string. The principle of length termination isn't about where the length is stored. – Raffzahn Jul 20 '22 at 14:38
  • @Raffzahn But, to me, "terminated" carries a connotation of happening "inline". You or some electricity run down a wire until you hit a termination block/device. You press Ctrl+C to terminate something mid-stream. A fixed-size or size-prefixed field has its size known before you begin traversing it. You don't use "terminated" when describing the physical size of a field on a print form. Yes, a for loop "terminates", but it feels wrong to work backwards from that to define the jargon for how that information is stored. – ssokolow Jul 20 '22 at 19:14
  • @ssokolow Well, 'Jargon' might be a good point here, especially every day use. Just, how to would you describe zero and length terminated using common vocabulary? Either handling is terminated by reaching it's target: a zero byte or a length. Also, you seem to stick to that 'size-prefixed' notation as part of length usage. As shown, length does not have to be at any specific position to make it useful for termination. In fact, fixed length is only a special case of length terminated, by having it's value known at compile time. All operation is still based on length termination. – Raffzahn Jul 21 '22 at 01:04
  • @ssokolow Thinking of it, maybe it helps to see it as terminal condition? It's about using a common (abstract) term to describe what it's about, in this case, providing a terminal condition when handling a string. For ASCIZ it's reaching a zero byte, for length it's having handled length number of characters. In either way it's about termination - much like the great example about FOR you added. Heck, it's all about looping and terminating the loop, isn't it? – Raffzahn Jul 21 '22 at 01:08
  • In common conversation, I've seen "counted strings", "length-prefixed strings", or "Pascal strings" for various types of "length terminated" strings and "null terminated" or "C strings" to refer to "zero terminated" strings. This is the first time I've seen someone using "length terminated"... probably because it feels strange to describe the data structure by terminology that's so focused on the algorithm used to traverse it that it feels misleading when thinking purely about the layout of the data structure. (Sort of like calling a binary tree a "root-terminated tree" after the traversals.) – ssokolow Jul 21 '22 at 13:00