60

In C, '' is used to denote a character, while "" is used to denote a string. Why was this syntax chosen?

I tried to research this using Wikipedia’s Timeline of Programming Languages along with Rosetta Code’s reference page for strings. It seems that C was the first widespread programming language which implemented this, since in popular languages before it like Pascal, ALGOL, COBOL and FORTRAN, '' and "" were interchangeable, or only one of them was used.

I know that it might seem like an obvious choice to use '' for characters and "" for strings, but it actually isn’t. Before programming, these symbols were only used in punctuation, and there is no such rule or convention that '' should be used when quoting smaller things.

Since I found Why was `!` chosen for negation? and Why was "C:" chosen for the first hard drive partition? on this SE site, I figured that this is the right place to ask this.

user3840170
  • 23,072
  • 4
  • 91
  • 150
hb20007
  • 653
  • 1
  • 4
  • 11
  • 8
    Welcome to Retrocomputing! Yes, this is the right site for this question. (Indeed, I'm surprised it hasn't already been asked.) – DrSheldon Jun 25 '21 at 16:21
  • 1
    Are you only interested in answers for C (and its offspring) or should we infer from the title a wider non-C context? – dave Jun 25 '21 at 17:06
  • 3
    Algol 60 (as defined by the Revised Report) didn't have characters, only strings. The reference language used different symbols for opening and closing string quotes. Implementations were all over the map on this; one I was familiar with used underlined brackets. – dave Jun 25 '21 at 17:21
  • 2
    Pascal uses ' for both character constants and string literals. Algol 68 (essentially) treated strings of length 1 as either single characters or strings depending on context, a decision that Charles Lindsey (a key figure in Algol 68's development) called "clearly a mistake" (see the section on Coercion in https://dl.acm.org/doi/10.1145/234286.1057810) – texdr.aft Jun 25 '21 at 18:00
  • 3
    @another-dave I am interested in why this was chosen in C. If C copied this convention from another language, then I would like to know the reason it was implemented that way in the original language. – hb20007 Jun 25 '21 at 18:08
  • Re "Before programming, these symbols were only used in punctuation...", that's not so. Quotation marks - " - were used in typewriting for quoting things (and also in handwriting & IIRC typesetting), so the use in C & other languages is an obvious inheritance. The single quote or apostrophe was used in contractions, and to indicate possessives (AKA "apostrope s"). The use for character literals (as per @DrSheldon's answer) seems an obvious choice given the limited character set available. – jamesqf Jun 26 '21 at 04:05
  • 4
    @jamesqf Not sure what you mean. Those uses ARE punctuation. – barbecue Jun 26 '21 at 04:15
  • @barbecue: No, punctuation (for English) is .,!?:; and maybe - and (). – jamesqf Jun 26 '21 at 16:19
  • 4
    @jamesqf where on earth did you get that idea? – barbecue Jun 26 '21 at 16:33
  • @jamesqf By definition, "quotation mark: each of a set of punctuation marks [...]" This page describes all of the English punctuation marks of which quotation marks are included. – Ouroborus Jun 26 '21 at 17:35
  • 1
    And as for "why do it that way round, and not " for characters, and ' for strings": A single quote suggests a single thing, while a double quote suggests two (or more things), so it's easier to remember this way round. – dirkt Jun 27 '21 at 04:37
  • 2
    With typical English punctuation usage, double-quotes are used to indicate, well, quotations. If the quoted text includes a nested quotation, then typically single-quotes are used for the inner quotation. It therefore seems logical to me to use double-quotes for strings and single-quotes for their constituent parts. – jamesdlin Jun 27 '21 at 10:06
  • The use of quotation marks for string literals was already the "standard" convention across scores of languages and had been for almost a decade. Indeed this was one of the very few (nearly) universal language conventions at the time considering the veritable zoo of languages that were available and viable. – RBarryYoung Jun 27 '21 at 14:37
  • @barbecue: I got the idea from reading and writing English for the past several decades. Punctuation marks denote pauses in speech. (See the dictionary definition.) Quotation marks, and other special characters like - well, most everything on the top row of your keyboard - don't really have any counterparts in speech. (And FWIW, just finding a web page that supports your idea doesn't mean the idea is correct. See e.g. recent politics :-() – jamesqf Jun 27 '21 at 16:14
  • 6
    @jamesqf I've also reading and writing English for the past several decades, and I have never seen such a restrictive definition of punctuation marks for modern English. The purpose of punctuation hasn't been just to identify pauses for centuries. The elocutionary definition you're using fell out of favor in the 17th century, when the syntactic school became prominent. Punctuation identifies not just pauses, but ways to clarify syntax. Since you are rejecting any sources from the web, I won't bother to provide links to Britannica, the OED, or other unreliable online sources. – barbecue Jun 27 '21 at 20:10
  • 2
    @jamesdlin, in American English, not 'typical' English. In British English it's (traditionally) the opposite: single quotes as the main, double quotes as the inner ones. From here, a curious observation: C (with its double quotes for strings) was invented by the Americans, while Pascal (with single quotes) by a European (not a Brit, but Europeans tend[ed] to learn British English). – Zeus Jun 28 '21 at 01:11
  • FORTRAN didn't officially add quoted strings, or the CHARACTER*length type, until 1977. At C's creation it used 'Hollerith constants' (and formats) like 17HTHIS IS SOME TEXT stored, if at all, in numeric variables or arrays -- preferably INTEGER because on some machines REAL (floating-point) would corrupt values that weren't really numbers. – dave_thompson_085 Jun 28 '21 at 11:59
  • 1
    Great first question! – Wayne Conrad Jun 29 '21 at 19:38

4 Answers4

53

For type system reasons, and for compatibility with B.

B is a programming language that served as the immediate ancestor of C. The salient thing about B is that it had no type system: all values in B are machine words (corresponding to the C type int). In B, there were two ways to represent strings in source code: string literals0, which evaluated to a pointer to a block of memory holding the string, and multi-character literals, which packed multiple character codes directly into a single machine word. The latter were famously used in Kernighan’s original ‘Hello, world’ program:

main( ) {
 extrn a, b, c;
 putchar(a); putchar(b); putchar(c); putchar('!*n');
}

a 'hell'; b 'o, w'; c 'orld';

Since the two kinds of values behaved so differently, yet could not be distinguished at the type system level (because there was none), they had to use different syntaxes.

As C is an evolution of B, it simply inherited all this baggage and could not change it without breaking compatibility. Although some breaking changes to the syntax were made in C after all, there apparently wasn’t a compelling enough reason to make one here. The weak typing of C does maintain a certain kind of continuity with B after all.

The above, though, raises the question of why such distinction was made in B. Since B was conceived as a simplified version of BCPL, one may think there might be some answers in materials about that language. But according to the manual, in BCPL character literals and string literals were not differentiated by delimiters, but by their length:

A string constant of length one has an Rvalue which is the bit pattern representation of the character; this is right justified and filled with zeros.

A string constant with length other than one is represented as a BCPL vector [i.e. array]; the length and the string characters are packed in successive words of the vector.

So the delimiter distinction between character literals and string literals was first made in B. As to why, and why the syntax was chosen the way it was, we are probably resigned to rely on speculation, as neither Users' Reference to B nor A Tutorial Introduction to the Language B nor The Development of the C Language elaborate on that particular topic. My hypothesis would be:

  • Because B allowed multiple characters in its character literals, it could no longer rely on differentiating characters from strings by the length of the literal (and because again, B had no type system to transparently inter-convert between them), and thus a syntactic distinction was necessary.
  • Character literals, as conceptually more lightweight (not requiring additional storage), were assigned the glyph that was (visually) simpler and took fewer keystrokes to type. (I shamelessly stole this one from @Toby Speight.)

This explanation is mostly conjecture, but it seems we may have a hard time finding a better one.


0 Contemporaneous documentation used the term ‘constant’ instead of ‘literal’, since it was the only kind available back then anyway.

user3840170
  • 23,072
  • 4
  • 91
  • 150
  • Do you know why " was chosen for string literals and ' for multi-character literals? Also, how did this end up being " for strings and ' for characters in C? – hb20007 Jun 25 '21 at 18:27
  • @hb20007 This I can only speculate on, but Toby Speight’s hypothesis that (multi-)character literals came earlier than strings seems pretty plausible to me. I might have to look into BCPL to search for clues. – user3840170 Jun 25 '21 at 18:30
  • It sounds plausible but we don't know if it indeed came earlier. – hb20007 Jun 25 '21 at 18:33
  • @hb20007 The other question is easy, though. C is a direct extension/evolution of B, and the first C compiler was simply extended from a B compiler. The syntax wasn’t even copied (reimplemented from scratch), it was inherited. – user3840170 Jun 25 '21 at 18:58
  • 1
    BCPL seems to have only had strings, not character literals. Interestingly, sample BCPL code in Wikipedia uses " for string literals, while the BCPL manual at https://www.bell-labs.com/usr/dmr/www/bcpl.html uses ' in its description of the syntax. – user3840170 Jun 25 '21 at 18:59
  • https://www.lysator.liu.se/c/clive-on-bcpl.html writes that BCPL had character consonants of the form 'a', and string consonants of the form "abc". I think this is wrong though. – hb20007 Jun 25 '21 at 19:14
  • 1
    @hb20007 I dug into the BCPL manual more closely, and it says BCPL made no syntactic distinction between character literals and string literals: ‘A string constant of length one has an Rvalue which is the bit pattern representation of the character; […] A string constant with length other than one is represented as a BCPL vector [i.e. array]’. BCPL had types, so maybe contextual disambiguation was tenable there, but certainly not in the untyped B. – user3840170 Jun 25 '21 at 19:17
  • So it seems that this syntax was chosen by the creators of B. – hb20007 Jun 25 '21 at 19:24
  • 2
    On your foot note, the term "literal" is not anachronistic. There is a distinction between a literal and a constant. In the expression const int a = 'b'; (yes, that is legal in C) a is an int constant with value of 0x62 and 'b' is a char literal. – JeremyP Jun 26 '21 at 09:43
  • @JeremyP That is legal only as late as in C99, it wasn’t legal in B or BCPL. The point is, was this distinction made back then? It’s a bit like referring to VMM32 as a ‘hypervisor’: sure, technically it was (and the term even existed at the time, of which I am less sure with ‘literal’), but it wasn’t actually used back when VMM32 was actively used. – user3840170 Jun 26 '21 at 09:59
  • 1
    @user3840170 It's always been legal in C. Anyway, that's not the point. The point is that a literal and a constant are distinct concepts. – JeremyP Jun 26 '21 at 10:02
  • @JeremyP Literals and constants are distinct concepts in modern understanding, not back when those languages were designed. (In fact, the GCC manual still occasionally refers to literals as ‘constants’! I strongly suspect this is a historical holdover.) Using modern terminology for not so modern inventions is an anachronism. – user3840170 Jun 26 '21 at 10:20
  • 3
    C did not, in its early life, have 'constants' that were not 'literal', so you can excuse that fairly impoverished language if its compilers sometimes confused the two. But other languages made the obvious distinction between a thing that literally denoted itself, and a thing that unchangingly denoted some value that was not the same as the marks from which it was made. (Note that other languages had 'denotations' rather than 'literals', but the concept is the same). – dave Jun 26 '21 at 11:09
  • 1
    Re: Anachronistic term "literal". I cannot speak to the C documentation, however in the industry the term was already widespread by the early 70's when I encountered it. I would not consider it an anachronism. – RBarryYoung Jun 27 '21 at 14:41
  • @JeremyP+ const wasn't always in C; it was added in the first standards C89 (ANSI) and C90 (ISO+), but it wasn't in original C or K&R1, and a lot of code had been written before it was added, which is why we have a const 'leak' in /*nonconst*/ char * strstr (const char * haystack, const char * needle) (C++ can fix this with overloading). – dave_thompson_085 Jun 28 '21 at 11:57
  • 1
    Literals versus constants - see AB 25,.3.1, March 1967. – dave Jun 29 '21 at 01:45
  • @another-dave Okay, I guess that ALGOL bulletin convinced me. – user3840170 Jun 29 '21 at 08:00
25

Not quite the same thing, but PDP-11 assemblers used 'X as a single character value (i.e., a byte), and "XY as a two-character value, (i.e., a word).

MOV  #"EH, BUFF
MOVB #'?, BUFF+2

The single/double quote corresponds nicely to the number of characters involved.

Ritchie et. al. would surely have been aware of this DEC convention. In fact, the same convention was carried over into the Unix assembler 'as' - which according to its man page was derived from the DEC assembler PAL-11R.

References: (both from 1971)

DEC usage: section 4.3 in this PAL-11R programmer's manual

Unix usage: section as(I) in the UNIX programmer's manual

FWIW, in DEC syntax, strings were rather different: a specific pseudo-op was used to declare strings, with arbitrary delimiter pairs, though slashes were conventional:

.ASCII /EH?/
.ASCII ZEH?Z
dave
  • 35,301
  • 3
  • 80
  • 160
6

Consider how other languages handled characters. Many languages represented characters as strings of length 1, rather than their own type. These were inefficient in several ways:

  • The source code was more verbose. Compare

    IF MID$(Q$,1,1) = "A" THEN
    

    to

    if (q[0] == 'A')
    
  • Operations such as extracting one character or performing a comparison are more efficient with character types than with strings. The MID$ operation above allocates and copies yet another string. The = operation requires scanning through the two strings involved.

  • For compiled languages, character literals take up less space than string literals, both on disk and as a loaded program. String literals need to allocate space for the characters, the length or terminating character, and the address which references the string. Character literals can simply be an immediate operand of the instruction you were going to compile anyway.

  • There were also constructs like CHR$(65), whereas 'A' is both more efficient and easier to understand.

As the intent of C is to create highly-efficient programs, it was necessary to have a character type separate from string types. In turn, this meant having a way to represent character literals separately from string literals.

Modern compilers probably could determine the data type from the surrounding context, but early compilers simply weren't that sophisticated.

DrSheldon
  • 15,979
  • 5
  • 49
  • 113
  • ISTR that Prime Fortran IV, (which for a while was a popular system programming language on Primes), extended Fortran with strings, which were packed into arrays of INTEGER (or anything else, I think). – Rich Jun 28 '21 at 04:16
  • I do not see how the first bullet point is in anyway necessary. Fortran uses ' and " interchangeably, characters are strings of length one, and still if (q(1:1) == 'A') works perfectly fine. Similarly for the second point. And the third point. You can have strings that come with their length information. Then youi do not need any terminating characters. Terminating characters in C are a well known source of buffer overflows and severe security failures. – Vladimir F Героям слава Jun 28 '21 at 17:01
  • 1
    @Rich Those are Hollerith. they were standard Fortran 66. If you take the example from the top answer: A=4Hhell B=4Ho, w C=4Horld. – Vladimir F Героям слава Jun 28 '21 at 17:07
3

In C, '' is used to denote a character, while "" is used to denote a string. Why was this syntax chosen?

The syntax difference is to describe two different constructs:

  • A character is a single value, used with its value, while
  • A string is an array of values, usually being pointed to.

Most important, without making that distinction, it would be impossible for a compiler to decide if "A" would refer to a character (value) of A, or define a string containing a single character A (with delimiting length or terminator)

Important for compiler construction: Having that distinction made upfront, with the first symbol of that token, simplifies the parser. Much like writing 0x in front of a hex number saves effort in seeing if it's a number or something else. The parser dos not have to read the whole token to see what it is about, but can go ahead according to that the leading symbol says.

Due to this necessity, two different quotes are needed.

Now why exactly these two were selected is hard to say, but it would seem intuitive that the single is for some short quote, while the double cover some longer item. This is as well kind of consistent with usage in English language writing, where speech and other quotation is primary between double marks. Which makes a lot of sense as regular English writing contains lots of single marks for concatenations and abbreviation.

It seems that C was the first widespread programming language which implemented this

C inherited it from B, which introduced this differentiation as part of its simplification from BCPL. B was written by Ken Thompson and Dennis Ritchie, who later went on to create C.

Toby Speight
  • 1,611
  • 14
  • 31
Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 13
    Double quote marks for strings are consistent with usage in American English orthography, but not with British English. (But of course C was invented by Americans). – alephzero Jun 25 '21 at 14:32
  • @alephzero Yes, you're right, British English is a way more creative. Still, in my experience as an outside observer, it does seem as if practical use in Britain has moved more and more toward US diction (in many more ways than just quotation mark handling). – Raffzahn Jun 25 '21 at 14:56
  • 8
    I like the consistency that has the narrower character ' for the shorter item. It could also be (I'm speculating here) that character literals were implemented before string literals, and on many keyboards ' is unshifted and " requires shift, so it's possible the simpler character was used first. – Toby Speight Jun 25 '21 at 15:18
  • @TobySpeight Sounds good to me. – Raffzahn Jun 25 '21 at 15:23
  • @user3840170 you might want to check the B Manual , especially section 4.1.5, which sounds much like a terminator delimited string - heck, he even calls it such :)) – Raffzahn Jun 25 '21 at 15:42
  • re: it would be impossible...* : It's not exactly impossible - Algol68 uses "A" for a character and "A" for a string, being an array of characters. The conceptual cost of this is the 'rowing' coercion ('row' is Algol68-ish for 'array'). Maybe for C it's impossible. Not clear from the title whether C is all we're interested in here. – dave Jun 25 '21 at 17:01
  • @another-dave Indeed very possible: for a more mainstream example, Borland’s Object Pascal dialects do this, with both single-character literals and strings denoted with single quotes. It just wasn’t possible in B, and by extension, in C. – user3840170 Jun 25 '21 at 17:04
  • 1
    It may be relevant that the ASCII standard does not have a character known as 'single quote', it has an 'apostrophe' (which is of course the crux of the biscuit). Whether you consider this as a suitable quotation mark depends on its appearance on the equipment available to you. – dave Jun 25 '21 at 17:14
  • 2
    It's also interesting that there are two Unix-y conventions around this: one is paired-apostrophes as in C, which works best IMO for 'straight' apostrophes, and the other (often used in man pages?) pairs apostrophe with accent-grave, which looks horrible on any font I've ever used. – dave Jun 25 '21 at 17:16
  • 2
    @another-dave See this page for more about that quoting convention. It's also used in the m4 macro processor and in TeX. – texdr.aft Jun 25 '21 at 17:54
  • @another-dave I remember grave accent used to look acceptable as an opening single quote in xterm. I recall some X fonts had the glyphs for \`` and'` designed specifically for that purpose. – user3840170 Jun 25 '21 at 19:22
  • 4
    In British English (or English, as the British call it), a quotation is enclosed in quotation marks, unless it’s nested. Single quotes are typically used when referring to something that might not be a valid linguistic construct otherwise, and single characters would certainly fit into this. So the choice to punctuate this way is entirely logical although I have no idea whether that was a driving force for the C language. – Frog Jun 25 '21 at 21:25
  • 1
    @texdr.aft m4 used to drive me crazy with its ridiculous use of "backquote" and "single quote" (as I thought of these characters) for quoting. It was only when I found out that a lot of popular character sets in the USA rendered them as symmetrical to each other that I understood why they did it. It's still wrong. – JeremyP Jun 26 '21 at 10:01
  • 2
    @JeremyP - even when symmetrical (i.e., resembling grave and acute), it was never self-evident to me which one was the opening mark and which one was the closing mark: should I write `foo´ or ´foo` ? – dave Jun 26 '21 at 14:03