76

Anyone who has dealt with strings at a low level (e.g., writing a parser in C), knows that doing so tends to involve frequent checks of—either manually, or through isalpha(), isalnum(), etc—whether a character is a case-insensitive alphabetic character...

(*c >= 'A' && *c <= 'Z') || (*c >= 'a' && *c <= 'z')

...or an alphanumeric character...

(*c >= '0' && *c <= '9') || (*c >= 'A' && *c <= 'Z') || (*c >= 'a' && *c <= 'z')

...because the above ranges are non-contiguous in the ASCII order of characters (and many subsequent character encodings such as UTF-8). If only these 3 ranges were adjacent to each other, a much more succinct...

*c >= '0' && *c <= 'z'

...would save countless keystrokes and once-precious clock cycles since the dawn of unix time (or actually a number of years earlier — signed integer).

I guess other considerations were considered more pressing at the time. Does anyone remember what they might have been?

Will
  • 763
  • 1
  • 5
  • 9
  • 77
    Have you seen EBCDIC? I think we're lucky that ASCII has the alphabet in one contiguous sequence! – Greg Hewgill Jun 26 '19 at 01:22
  • 19
    If you're writing in C, you should probably be using isalnum(). And if you're implementing isalnum() on all but tiny-memory systems, you'd do well to use a lookup table -- so the call just turns into something like return tbl[ch] & (ISALPHA|ISDIGIT). – dave Jun 26 '19 at 12:09
  • 8
    How would you think localization into this? – Thorbjørn Ravn Andersen Jun 26 '19 at 15:05
  • 1
    @GregHewgill recently had to convert something from EBCDIC to ASCII at work (dd if=blah.ebcdic conv=ascii > blah.txt for the unfortunate like me) and noticed those funky things about it as well. I've asked a similar question hoping to get some good insight on it. – Captain Man Jun 26 '19 at 16:08
  • @ThorbjørnRavnAndersen - Have a table per locale, probably; not a problem for demand-loaded shared libraries. Or maybe lazy-initialize the table from more compact representation; the info must exist somewhere for supported locales. The table gets more complicated for multibyte encodings, of course. – dave Jun 27 '19 at 02:59
  • 1
    @another-dave and now you’ve made the original question irrelevant. – Thorbjørn Ravn Andersen Jun 27 '19 at 07:27
  • @captainMan and then you find out you need another variant of Ebcdic because you have Swedish characters :-/ – Thorbjørn Ravn Andersen Jun 27 '19 at 07:29
  • If your system does not offer isAlpha() or whatever, and if you are a beginner, it is a good exercise to write your own and put it in a handy library. – RedSonja Jun 27 '19 at 11:03
  • @ThorbjørnRavnAndersen - irrelevant? Not at all. It's asking about the historical reasons ASCII is the way it is. My observation is about how to write efficient code given that ASCII is the way it is, followed up by comments about extensions of that method to non-ASCII codes. The original question is still a good one. – dave Jun 27 '19 at 12:02
  • 1
    If you really want to learn more about this I recommend this book: https://www.amazon.com/Coded-Character-Sets-History-Development/dp/0201144603. It really goes into great detail, especially from people that tended to "look" at the bit patterns rather than just "deal" with them. – Chris Haas Jun 28 '19 at 17:19
  • The other answers have all explained the rationale for the way that ASCII is organized, but I would like to point out doing mathematical comparisons on code points to find digits, uppercase letters, and so on is a very bad idea in this day and age, because most text, even in traditionally ASCII-based environments such as Unix, is now UTF-8. Such things should be determined by lookups in Unicode databases, not by doing math. – Serentty Jul 23 '19 at 19:55
  • @Gareth 1) I don't see how the above examples are more "mathy" than using a code point as a table index is "math". 2) Those comparisons work exactly the same in UTF-8 -- that's pretty much the whole point of UTF-8's design. – Will Jul 24 '19 at 06:49
  • @Will These comparisons still work if the UTF-8 text contains only ASCII characters. They break down horribly if you run into even the most basic non-ASCII characters, such as the ones from Latin-1. That's why I say that it's a bad idea to be doing them in new code. – Serentty Jul 29 '19 at 00:18
  • @Will I understand how UTF-8 works. Perhaps I should have been more clear in what I meant. Yes, they still work on strings with non-ASCII characters, but only on the ASCII characters in that string. Checking a single bit to see if a letter is uppercase doesn't work in UTF-8 if that letter is Æ or Ø. And then there are letters in Unicode which have no case at all. That's why I say that these tricks should be avoided in new code: they only work on the first 128 code points of the more than 100,000 code points in Unicode. – Serentty Jul 30 '19 at 17:49
  • @Serentty: The majority of text that is processed by computers is machine-readable content in formats like like HTML, json, csv, program source text, etc. and in most such formats, the only kinds of "letters" that are relevant are the ASCII characters 0x41-0x5A and 0x61-0x7A. Proper conversion of human-readable text between uppercase and lowercase cannot be done in context-free fashion, but requires knowing what human language the text in question is supposed to represent. – supercat Jun 15 '22 at 21:27
  • @supercat Most programming languages do not restrict identifiers to ASCII ranges, so you can generally not assume that they will be in the ASCII range according to their specifications, even if they usually are. – Serentty Jul 15 '22 at 02:49
  • @Serentty: If a program's purpose is to generate programming-language output, or to process output that was produced by programs that are known to use ASCII, then it need not concern itself with anything other than ASCII. Ironically, use of non-ASCII makes things things less human-readable than they otherwise would be. If one is shown two ASCII identifiers in different fonts, no special knowledge would be needed to recognize if they are the same or different. Could the same be said of e.g. ɸ, ϕ, and ϕ (the latter of which is an italicized version of one of the former)? – supercat Jul 15 '22 at 15:47
  • @Serentty: It used to be that language compilers would be completely agnostic to any meaning possessed by bytes that didn't represent characters in the source character set. If one wrote printf("£100"); a compiler would generate a string containing all of the bytes that appeared in the source file between the first 0x22 byte and the next one, provided only that none of them was a 0x0A, 0x0D, 0x5C, or 0x3F. If £ was one byte in the source character set, the string would be four bytes long. If £ was two bytes, the string would be five bytes, but the compiler would't care that.... – supercat Jul 15 '22 at 17:02
  • ...two of those bytes were used to represent a single character £, any more than it would care if a source file contained à represented as the three byte sequence 0x60, 0x08, 0x61 [grave, backspace, small a] which would appear as à on a typical dot matrix or daisy wheel printer. A compiler could accept in identifiers any bytes that have no other assigned meaning without having to know or care about how those bytes might be subdivided into glyphs. – supercat Jul 15 '22 at 17:04

7 Answers7

125

Why is ASCII this way?

First of all, there is no one best sorting order for everything. For example, should UPPER or lower case be first? Should numbers be before or after letters? Too many choices, and no way to please everyone. So they came up with specific pieces that "made sense":

  • Numerals

0x30–0x39 - Easy bit mask to get your integer value.

  • UPPER case letters

0x41–0x5A - Another easy bit mask to get your letter relative value. They could have started with 0x40, or put the space at the beginning. They ended up putting space at the beginning of all printable characters (0x20), which makes a lot of sense. So we ended up with @ at 0x40 - no specific logic to that particular character that I know of, but having something there and starting the letters at 0x41 makes sense to me for the times you need a placeholder of some sort to mark "right before the letters".

  • lower case letters

0x61–0x7a - Again a simple bit mask to get the letter relative value. Plus, if you want to turn UPPER into lower or vice versa, just flip one bit.

  • Control codes

These could have been anywhere. But placing them at the beginning has the nice advantage that extensions to the character set - from 128 to 256 and beyond - can treat everything >= 0x20 as printable.

  • Everything else

Everything else got filled in - 0x21–0x2f, 0x3a–0x3f, 0x5b–0x5f, 0x7b–0x7f. "Matching" characters generally next to each other, like ( and ), or with one character separating them, like < = > and [ \ ]. Most of the more universally used characters are earlier in the character set. The last character - 0x7f (delete or rubout) is another special case because it has all 7 bits set - see Delete character for all the gory details.

Why is C (and most other languages) this way?

A high-level language should be designed to be machine independent, or rather to make it possible to implement on different architectures with minimal changes. A language may be far more efficient on one architecture than another, but it should be possible to implement reasonably well on different architectures. A common example of this is Endianness, but another example is character sets. Most character sets - yes, even EBCDIC, do some logical grouping of letters and numbers. In the case of EBCDIC, the letters are in sequence, but lower case is before UPPER case and each alphabet is split into 3 chunks. So isalpha(), isalnum() and similar functions play a vital role. If you use (*c >= '0' && *c <= '9') || (*c >= 'A' && *c <= 'Z') || (*c >= 'a' && *c <= 'z') on an ASCII system it will be correct, but on an EBCDIC system it will not be correct - it will have quite a few False Positives. And while *c >= '0' && *c <= 'z' would have lots of False Positives in ASCII, it will totally fail in EBCDIC.

Arguably, a "perfect text sorting character set" could be created that would fit your "ideal", but it would inevitably be less-than-ideal for some other use. Every character set is a compromise.

  • 1
    Numbers should of course come after letters - turning this is one of the abominations of ASCII :)) Great Answer, well written. It may be worth adding that (serious) systems use attribute sets to classify characters as well as relationship lists as not every lowercase char has a unique uppercase or vice versa (today in most cases based on the definitions provided with Unicode. Similar context and language specific sort orders are used instead of simple charsets (used for storage). – Raffzahn Jun 26 '19 at 06:36
  • 12
    @Raffzahn Of course numbers come before letters. With greater-than-10 radii it's a must, everywhere else it's just convent. – Agent_L Jun 26 '19 at 10:49
  • 4
    @Agent_L More than that, they should come immediately before the letters, making hex-conversion (and, in fact, all the way up to hexatridecimal conversion) much easier! – TripeHound Jun 26 '19 at 13:11
  • 9
    A minor addition: C, C++, Java, and probably a bunch more languages require the codes for '0'..'9' to be contiguous and increasing, so identifying the digits 0..9 with *c >= '0' && *c <= '9' always works in those languages. – Pete Becker Jun 26 '19 at 14:03
  • 3
    "Plus, if you want to turn UPPER into lower or vice versa, just flip one bit." Yup. I remember quite often adding and subtracting 32 to change case. – RonJohn Jun 26 '19 at 14:11
  • 7
    Amusingly, the [\] sequence has accidentally (mostly) saved people using regexes to filter out punctuation. I've seen people in Python use '['+string.punctuation+']' to construct a regex character class to find/replace punctuation, and it mostly works, because the backslash preceding ] escapes it, preventing it from being seen as the end of the character class (it just fails to find the backslash itself, which is a bug, but a much less noticeable one). – ShadowRanger Jun 26 '19 at 16:11
  • 8
    ... just flip one bit ... many of the older hardware terminals worked like that, just connect the A-Z keys to bits 0-4, and the shift key to bit 5. – Guntram Blohm Jun 26 '19 at 17:34
  • 4
    If you look at the layout of an ASR33 keyboard you can also see that the Shifted numbers match the the other punctuation characters. Even to the extent that a shifted 0 is a space and () are shifted 8&9 – PeterI Jun 26 '19 at 18:09
  • 2
    @TripeHound: One can place numbers immediately before letters, one can make the lower bits of the number codes match their values, or one can have a bit which is different between all numbers and all letters. Pick two. – supercat Jun 26 '19 at 18:24
  • @RonJohn Go and upercase an ü that way ;)) – Raffzahn Jun 26 '19 at 20:26
  • 2
    @Raffzahn you forget what ASCII means. – RonJohn Jun 26 '19 at 20:35
  • 1
    @RonJohn Nice try, just checked, the question isn't restricted to ASCII but ASCII and subsequent, so go again :) – Raffzahn Jun 26 '19 at 20:48
  • 1
    @ShadowRanger If you consider turning a severe bug into a harder to detect bug a save ... – Hagen von Eitzen Jun 26 '19 at 21:06
  • 2
    @Raffzahn those "subsequent encodings" don't count. 'Murica!!!!!!! – RonJohn Jun 26 '19 at 21:10
  • @RonJohn Murica ain't great anymore. Ask that guy with a guinea pig on his head. But KNOWLEDGE is! And to finish this, the nerdy answer would have been flip the bit - as DIN 646 did place umlauts in synch with upper/lower case ASCII letters :)) (Jup, there's a reason I used üand Ü, as it's only used that way in the German variant - other IRV variants are less symetric :)) – Raffzahn Jun 26 '19 at 22:57
  • 4
    @RonJohn: More interesting is that you can XOR to flip without figuring out which it was first, like you'd need for + or -. Or c |= 0x20 to unconditionally make it lower case, letting you check for alphabetic ASCII with only 3 asm instructions (c |= 0x20; c -= 'a'; then c <= (unsigned)'z'-'a'). You couldn't do this if the range of upper-case character spanned a % 0x20 alignment boundary, even if it was still +- 0x20 away from the lower-case range. See What is the idea behind ^= 32, that converts lowercase letters to upper and vice versa? – Peter Cordes Jun 27 '19 at 03:57
  • 1
    @Abigail: Character code 0x7F isn't an action that means "delete" something, but instead is an indication that something has been deleted. If, someone who is punching a tape means to type types "CAT" but accidentally types "CAR" and realizes it immediately, they could push a plunger that physically reverses the tape one space (without generating any sort of character code), type a "rubout" character, and then type the "T". Some receiving devices would render the tape as "CAT" and others as "CA█T", some as "CA←T", and some as "CA_T", but in any case an operator could infer what was meant. – supercat Jun 27 '19 at 16:08
  • @Abigail: BTW, I don't think the assignment of back-arrow to 0x5F, nor its replacement with underscore, were coincidental. Both glyphs would be sensible representations of 0xFF. – supercat Jun 27 '19 at 16:09
  • @supercat Correct. But as an effective extension to interactive communication, it was repurposed on many systems to be an alternative to Backspace/Control-H. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 16:10
  • @manassehkatz: Altair BASIC would also accept 0x5F the same way, but that hardly means it's not a "printing character". – supercat Jun 27 '19 at 16:16
  • @Abigail - now I see the issue. 0x5F - no real question there - it is underscore (and formerly a different glyph) but always nominally printable. 0x7F - nominally NOT printable, but that is just one very special character. More important perhaps is the question of the next 32 characters - i.e., same as first 32 but with 8th bit set. Those really vary quite a bit by character set. For a typical terminal implementation, mapping those characters to the 7-bit control codes makes a LOT of sense. But once you get into the IBM PC era, things changed quite a bit... – manassehkatz-Moving 2 Codidact Jun 27 '19 at 16:25
  • Code page 437 puts printable characters everywhere. Notably, the first 32 (and also 0x7F) are miscellaneous odd 'n ends. But starting with 0x80 you get "real" characters for languages other than English. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 16:27
  • @Abigail: The standard may not require that implementations treat 0x7F as printable, unless it forbids implementations from treating as printable (and I don't think it does), I see no reason that implementations would be obligated to treat it specially. – supercat Jun 27 '19 at 16:35
  • 1
    @manassehkatz: Many systems have display hardware that will read a character data from the display storage (typically RAM), use some bits of that along with the number of the current scan within the text row to form an address (typically 11-12 bits), fetch a byte from there, and display the bits thereof sequentially. Such devices will be capable of displaying a power-of-two number of different characters. No reason to populate the area for codes 0-31, since the display hardware's going to show something. – supercat Jun 27 '19 at 21:56
  • 1
    Having the alpha codes where they are is what makes my .signature possible: main(i){putchar(340056100>>(i-1)*5&31|!!(i<6)<<6)&&main(++i);} – dgould Jun 28 '19 at 18:16
  • 1
    @TripeHound Does this redeem the ZX-81 character set (´0´ is at 0x1c, ´A´ follows immediately after ´9´)? – Vatine Mar 11 '21 at 07:46
62

According to ASA X3.4-1963 Appendix A, one of the design considerations was:

(7) Ease in the identification of classes of characters

Furthermore:

A4.4 The character set was structured to enable the easy identification of classes of graphics and controls.

And on page 8:

A6.3 To simplify the design of typewriter-like devices, it is desirable that there be only a common 1-bit difference between characters normally paired on keytops. This, together with the requirement for a contiguous alphabet, the collating requirements outlined above, and international considerations, resulted in the placement of the alphabet in the last two columns of the graphic subset. This left the second column of the graphic subset for the numerals.

There is a considerable amount of other information about the structure of ASCII in that appendix.

Greg Hewgill
  • 6,999
  • 1
  • 28
  • 32
  • 3
    Teletypes were typewriter like devices, and they began using ASCII before computers did. – Walter Mitty Jun 26 '19 at 10:38
  • 4
    +1 This, as Wikipedia says: Locating the lowercase letters in sticks 6 and 7 caused the characters to differ in bit pattern from the upper case by a single bit, which simplified case-insensitive character matching and the construction of keyboards and printers. – rexkogitans Jun 26 '19 at 11:57
  • 7
    IOW, the X3.4 people knew what they were doing. – RonJohn Jun 26 '19 at 14:14
  • 3
    @RonJohn: Indeed. And it's so rare that we can answer a "Why (did|didn't) they do X?" question with such a succinct quote from a document written over 50 years ago. – Greg Hewgill Jun 26 '19 at 18:12
  • 5
    On the contrary, it's my experience that most of "Why did they design this old piece of crap like this??" (and I've seen a lot, not just on RC) have very good reasons easily discoverable on the Intarweb (though you might not know the right Google Fu). – RonJohn Jun 26 '19 at 18:22
  • It's also worth remembering that ASCII wasn't the first ITA character set spec. Not only was it designed originally for tty interoperability rather than computer programming -- early versions of ITA that predated ASCII also predated the invention of the electronic digital computer. – wrosecrans Jun 27 '19 at 03:43
35

man 7 ascii of Linux Programmer's Manual says,

Uppercase and lowercase characters differ by just one bit and the ASCII character 2 differs from the double quote by just one bit, too. That made it much easier to encode characters mechanically or with a non microcontroller-based electronic keyboard and that pairing was found on old teletypes.

As supplement information, Eric S. Raymond authored Things Every Hacker Once Knew a few years ago. It has a section on the purposes of various designs in ASCII. Reading it is strongly recommended.

ASCII, the American Standard Code for Information Interchange, evolved in the early 1960s out of a family of character codes used on teletypes.

ASCII, unlike a lot of other early character encodings, is likely to live forever - because by design the low 127 code points of Unicode are ASCII. If you know what UTF-8 is (and you should) every ASCII file is correct UTF-8 as well.

The following table describes ASCII-1967, the version in use today. This is the 16x4 format given in most references.

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

However, this format - less used because the shape is inconvenient - probably does more to explain the encoding:

   0000000 NUL    0100000      1000000 @    1100000 `
   0000001 SOH    0100001 !    1000001 A    1100001 a
   0000010 STX    0100010 "    1000010 B    1100010 b
   0000011 ETX    0100011 #    1000011 C    1100011 c
   0000100 EOT    0100100 $    1000100 D    1100100 d
   0000101 ENQ    0100101 %    1000101 E    1100101 e
   0000110 ACK    0100110 &    1000110 F    1100110 f
   0000111 BEL    0100111 '    1000111 G    1100111 g
   0001000 BS     0101000 (    1001000 H    1101000 h
   0001001 HT     0101001 )    1001001 I    1101001 i
   0001010 LF     0101010 *    1001010 J    1101010 j
   0001011 VT     0101011 +    1001011 K    1101011 k
   0001100 FF     0101100 ,    1001100 L    1101100 l
   0001101 CR     0101101 -    1001101 M    1101101 m
   0001110 SO     0101110 .    1001110 N    1101110 n
   0001111 SI     0101111 /    1001111 O    1101111 o
   0010000 DLE    0110000 0    1010000 P    1110000 p
   0010001 DC1    0110001 1    1010001 Q    1110001 q
   0010010 DC2    0110010 2    1010010 R    1110010 r
   0010011 DC3    0110011 3    1010011 S    1110011 s
   0010100 DC4    0110100 4    1010100 T    1110100 t
   0010101 NAK    0110101 5    1010101 U    1110101 u
   0010110 SYN    0110110 6    1010110 V    1110110 v
   0010111 ETB    0110111 7    1010111 W    1110111 w
   0011000 CAN    0111000 8    1011000 X    1111000 x
   0011001 EM     0111001 9    1011001 Y    1111001 y
   0011010 SUB    0111010 :    1011010 Z    1111010 z
   0011011 ESC    0111011 ;    1011011 [    1111011 {
   0011100 FS     0111100 <    1011100 \    1111100 |
   0011101 GS     0111101 =    1011101 ]    1111101 }
   0011110 RS     0111110 >    1011110 ^    1111110 ~
   0011111 US     0111111 ?    1011111 _    1111111 DEL

Using the second table, it’s easier to understand a couple of things:

  • The Control modifier on your keyboard basically clears the top three bits of whatever character you type, leaving the bottom five and mapping it to the 0..31 range. So, for example, Ctrl-SPACE, Ctrl-@, and Ctrl-` all mean the same thing: NUL.

  • Very old keyboards used to do Shift just by toggling the 32 or 16 bit, depending on the key; this is why the relationship between small and capital letters in ASCII is so regular, and the relationship between numbers and symbols, and some pairs of symbols, is sort of regular if you squint at it. The ASR-33, which was an all-uppercase terminal, even let you generate some punctuation characters it didn’t have keys for by shifting the 16 bit; thus, for example, Shift-K (0x4B) became a [ (0x5B)

It used to be common knowledge that the original 1963 ASCII had been sightly different. It lacked tilde and vertical bar; 5E was an up-arrow rather than a caret, and 5F was a left arrow rather than underscore. Some early adopters (notably DEC) held to the 1963 version.

If you learned your chops after 1990 or so, the mysterious part of this is likely the control characters, code points 0-31. You probably know that C uses NUL as a string terminator. Others, notably LF = Line Feed and HT = Horizontal Tab, show up in plain text. But what about the rest?

Many of these are remnants from teletype protocols that have either been dead for a very long time or, if still live, are completely unknown in computing circles. A few had conventional meanings that were half-forgotten even before Internet times. A very few are still used in binary data protocols today.

Here’s a tour of the meanings these had in older computing, or retain today. If you feel an urge to send me more, remember that the emphasis here is on what was common knowledge back in the day. If I don’t know it now, we probably didn’t generally know it then.

Full text.

比尔盖子
  • 3,114
  • 1
  • 14
  • 32
  • 1
    A good supplement to Things Every Hacker Once Knew is Aivosto's Control characters in ASCII and Unicode which is too long to excerpt here as another answer, but goes into detail on the history and meaning of each control character. – ssokolow Mar 11 '21 at 01:22
  • I find it curious that the cited article suggests there's no relation between DC1/DC3 and the modern use of Xon/Xoff, when the practical effect of sending DC1/DC3 to an ASR33 was to start or stop automatic transmission of data from the tape. Since the only way data would be sent automatically was from the tape, this generalized to start or send the remote automatic transmission of data. – supercat Jul 18 '22 at 17:30
17

Hex chart of ASCII characters

This chart (showing the hexadecimal values of ASCII characters) outlines manassehkatz's answer graphically:

  • Numbers are at 0x30 + the value of the number
  • Capital letters are at 0x40 + the value of the letter (A=1, B=2 etc)
  • Lowercase letters are at 0x60 + the value of the letter.
Dragon
  • 271
  • 1
  • 2
8

Oldstyle ASR-33 teletype machines (telex machines) only handled 7-bit codes. They only handled uppercase English-language characters, the ten digits, and some punctuation.

They printed with this little cylindrical print head with a limited number of characters available.
ASR-33 print head cylinder

Later, tons of terminals, both printing and screen-based, came on the market using the same code.

Lowercase English letters were more or less an afterthought. Various terminal products (not ASR-33s) started using the 8th bit. Making the uppercase and lowercase letters contiguous in the code space would have created a new code incompatible with the old ASR-33 code. So, instead, they made the lowercase letter codes by ORing in the 0x40 bit to the existing codes.

That's why the codes aren't contiguous: backward compatibility with teletypes.

I remember getting my first ADM-3a terminal with 8-bit support. It was awesome for the time. It made UNIX useful. I had to flip some dip switches on the PDP-11 serial port card to support it. Yeah, back then a serial port card was bigger than a Raspberry Pi is now, and drew far more power.

ADM-3a terminal

Back in the day of serial ports and modems a big hassle was setting them up right. You needed to know the number of bits in the character, the parity, and the number of stop bits of the other end of the connection. Here's the serial port config screen from Windows 3.1:
Windows 3.1 serial config

Toby Speight
  • 1,611
  • 14
  • 31
O. Jones
  • 181
  • 5
  • 2
    Did they really started using the 8th bit? The end result was all in 7-bits, and in fact at the time (and for quite a while after), many communication protocols only supported 7 bits with the 8th bit as parity. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 14:49
  • They sure did start using it. See my edit. – O. Jones Jun 27 '19 at 15:15
  • 1
    My point wasn't about using 8 bits for serial communications in the microcomputer era - that is clearly the case and led to "Code Pages" and "Extended ASCII" and various other ways to use the upper 128 characters of 8-bit bytes used for the (nominally 7 bit) ASCII code. I am questioning whether 8th bit was ever used for lower case on an ASR 33 (or similar device in the 1960s). A Windows 95 port configuration was about 30 years after ASCII and the ASR 33. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 15:19
  • 2
    @manassehkatz, Totally agree--ASCII was and still is a 7-bit code space--but when you say, "the parity bit" you're mixing protocol levels: There is no parity bit in ASCII. If you're talking about a PC COM port, then the parity bit, when its enabled, exists only in the async-serial protocol that carries the 7-bit ASCII codes over the wire from one UART to another. – Solomon Slow Jun 27 '19 at 15:35
  • Quiz question: Why did they make the DEL character all-ones (0xFF)? – O. Jones Jun 27 '19 at 15:53
  • @SolomonSlow Correct - parity was not part of ASCII per se. But since ASCII was 7-bit, many systems used an 8th bit for parity. My point is that the answer refers to "started using the 8th bit" in the context of the ASR33 time frame (though ASR33 were actually used far into the microcomputer era, but by then lower-case ASCII was well established using only 7 bits) and that while the 8th bit may have been used by some system (not ASCII) for lower case, it was never used that way in ASCII (only 7 bits) or in an ASR33. But looks like answer has been edited a bit, which is good. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 15:55
  • The ADM-3A was a wonderful terminal. However, while one of its innovations was that it supported lower case ASCII (optional!), and it did support an 8-bit data format (as well as all variants of 7-bit + parity), the 8th bit was ignored when displaying characters. In other words, the 8th-bit was in "name only". It didn't actually do anything. And in fact, that was (largely) the norm until the IBM PC with its 8-bit extended ASCII character set. Calling it "8-bit support" is pushing it a bit (pun intended, of course) in my opinion. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 16:02
  • 1
    As far as DEL = 7F, see https://en.wikipedia.org/wiki/Delete_character (already in my answer above) - essentially a useful artifact of punched paper tape. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 16:04
  • @manassehkatz You're still mixing up protocol layers. The fact that a UART can send character codes with parity does not depend on the codes having only 7 significant bits. The 8250 UART chip that was used in the original IBM-PC could be configured to send bytes having 5, 6, 7, or 8 data bits, with or without parity. 8 data + parity was not a problem for the hardware. Newer UARTs can do the same, even if the 8 data+parity configuration is rarely used. The software that that feeds characters to a UART and receives characters from it does not normally ever see the parity bits. – Solomon Slow Jun 27 '19 at 17:15
  • @SolomonSlow I know the difference of protocol vs. character set etc. My main point is that ASR33 and ADM-3A did not do anything with that 8th bit except as parity - i.e., despite the ADM-3A at least having the option of using the 8th bit as data, it never did anything with it as data. So statements that seem to mix in "lower case" and "8-bit support" on those terminals are not correct. There may have been different ideas of how to add lower-case support to ASR33 (and to what became ASCII) but they did not, AFAIK, have anything to do with an 8th bit. – manassehkatz-Moving 2 Codidact Jun 27 '19 at 17:33
  • @manassehkatz-Moving2Codidact: I'm pretty certain the ASR33 would interpret the eighth bit as an indication of whether to punch the eighth column on the tape. I don't know what role the eighth bit played, if any, in deciding whether to stop advancing the tape upon receipt of a DC4 (which would be 0001 0100 with even parity). – supercat Jul 18 '22 at 17:17
8

None of the existing answers consider the context of the development of ASCII. Remember, the first version of the standard was released in 1963.

  1. At the time, hexadecimal notation was neither popular nor standardized. Most systems used either binary-coded-decimal or octal, for which the digits 0-9 are sufficient.

    As described in this question, at least four other ways of representing the hexadecimal digits 10-15 were used. The choice of A-F -- which may seem obvious today -- was not introduced until the IBM System/360, which first came out in 1964, after the first version of the ASCII standard.

    Thus, hexadecimal notation was not even a consideration by the ASCII committee.

  2. The ASCII committee decided early on that the lowest four bits of a decimal digit's encoding must be the actual digit. In other words, 0 must encode as XXX0000, and 9 must encode as XXX1001.

    Criterion 6. The numerics should have bit patterns such that the four low-order bits shall be the binary coded decimal representation of numerics.

    Coded Character Sets, History and Development, PDF page 257, logical page 235.

    So far, your proposal is consistent with the standard. It would place 0 at XXX0000, 9 at XXX1001, A at XXX1010, and F at XXX1111.

  3. Another decision of the committee was that the alphabetic characters all had to be contiguous:

    Criterion 10. The alphabetics should have contiguous bit patterns.

    ibid

    Considering your proposal, A through F are already contiguous. G would have to continue in a new group of 16, (XXX+1)0000. V would be (XXX+1)1111. W would continue into a third group of 16, (XXX+2)0000. Z would be (XXX+2)0011.

  4. Where your proposal fails is the committee's requirement that all alphabetic characters must fit into 5 bits:

    Criterion 9: The alphabetics A through Z, and some code positions contiguous to the code position of Z, should be contained in a 5 bit subset.

    ibid

    At the time of the standard's creation, every bit was precious. Packing customer names into 5-bit fields instead of 6 could significantly shrink the size of a business database.

    As we previously noted, your proposal places A at XXX1010 and Z at (XXX+2)0011. Such an encoding overlaps a 5-bit boundary, and thus does not meet the committee's criteria.

DrSheldon
  • 15,979
  • 5
  • 49
  • 113
-2

The argument you make with resorting to C expression complexity can be recast as follows:

If anyone really wanted to classify the ASCII characters so fast that the machine code output from a competent compiler wasn't enough, they'd be needing an FPGA or an ASIC anyway, and at that point it's irrelevant what the code is, since the most efficient encoding of the code ranges in the FPGA fabric or gate arrays will be nothing like C and vastly more efficient in energy use per each character classified, etc.

IOW: If speed matters that much, you'll be dealing with FPGA or custom silicon, and then nobody cares what it looks like in C. Both FPGAs and silicon can deal with some logic patterns that are terribly inefficient in C, like content-addressable memory, etc.

And, in any case, the way most character classification is done, the contiguity wouldn't help much, since there's a large variety of character classes that are entirely application-dependent, e.g. a lexer or parser might use a character class that finds no use outside of programming language front-ends, and nobody would be arguing to sort ASCII so that lexing C or Python is "cheap".

Furthermore, I'd dare say at this point there's more character classification done on Unicode code points, and most likely those of the Chinese character set / kanji, than ASCII, so there's that too.

  • This doesn't appear to address the question, which is about the origins of the ASCII coding, not about how to write character classifiers. – Toby Speight Mar 13 '21 at 15:40