19

As anyone who has been bitten by using base64 instead of base64url is quite well aware, the "original" base64 alphabet uses alphanumeric, +, = (both perfectly cromulent URL characters), and the dreaded /. I want to know how this came about, because it seems that using / in an encoding alphabet is extremely short-sighted.

I have been able to track the origin of this through:

We can however look to RFC821/RFC822, the famous Simple Mail Transfer Protocol (SMTP). There is no mention of base64 encoding, or any binary-to-text encoding for that matter. I'm not even sure the idea of attaching/sending binary data via email existed at the time, and it wasn't until 1992 that the first MIME email attachment was sent.

I estimate, given the 1993 RFC, that the decision of what would go into the base64 encoding occurred sometime in the mid-80s. Borne shell and awk would have been a thing. ASCII and EBCDIC would be fairly established. I'm guessing it would have happened after uuencode was invented in 1980 (which includes _), though it may have been made without knowledge of uuencode.

Regardless, all the folks working on this sort early tech of were surely familiar with UNIX to some degree, and the convention of using / for path separation. Further, the use of the / symbol as a separator of the form "something / something_else" goes way back, probably to the 18th century at least. I'm guessing this usage is probably what subconsciously influence the choice of / as the UNIX path separator. Surely, one might consider the idea of base64blob1/base64blob2. All this put together makes / a really strange choice for the fledgling base64 encoding alphabet, even in a pre-URL age, especially since better alternatives like - and _ are right there. Heck, even ,.@!$&% are all better candidates than / in my book. / is literally the only character besides \0 disallowed in UNIX filenames.

I guess the motivation for this is, in my mind (as someone who started writing code in the 90s), there is a continuum of characters from "most word-like" to "most code/delimeter-like" (i.e. more like control characters than actual identifiers).

  • a-zA-Z the exemplar of word-like
  • 0-9 still quite word-like, valid in identifiers
  • _ basically the "I need a space but it needs to be an identifier"
  • - word-like in some contexts
  • @#$%&+= - vaguely word-like but also code-like
  • !?:;., - definitely have that delimiter feel
  • ^*| - more code-like than word-like
  • ""''<>[](){} ok these are definitely delimiters/brackets
  • \/ and backtick - unabashedly separators

I wasn't there, but a lot of these "tropes" originate in the 60's and 70's, so I would imagine 80's developers would have similar intuition. / just "feels wrong".

So how did this come about?

Edit: ARPA Text Internet Messages grammar gives special meaning to ()<>@,;:\.[], so it might make sense why those would be excluded, albeit they can be string-quoted. But that still leaves -_, which is present in ASCII 1965 and EBCDIC. Is there some other early character encoding that lacks these that would sway the decision? Perhaps PETSCII?

Edit 2: Let us assume that the character set needed to be common between ASCII63, EBDCIC invariant, and PETSCII. If we take the intersection of these, subtract the SMTP special characters and the obvious alphanumerics, that leaves the candidates =-%/?*&+, of which we need 3. =+- makes a ton of sense, has some nice symmetry to it. ?%& also seem fairly reasonable (pre-URL). / and * feel like the least promising. Is there some other character set or protocol restriction out there which may have ruled out -?%&, necessitating /?

user3840170
  • 23,072
  • 4
  • 91
  • 150
DeusXMachina
  • 333
  • 2
  • 5
  • 9
    uuencode seems to be from 1980. But it is not clear what is the problem. Maybe you could give an example of what it is you do not like. Evidently things work, to some extent at least. – Tomas By Apr 25 '22 at 17:05
  • 4
    RFCs 1113 & 989 also define base 64? – Tomas By Apr 25 '22 at 17:12
  • 4
    "A 64-character subset of International Alphabet IA5 is used, enabling 6 bits to be represented per printable character. (The proposed subset of characters is represented identically in IA5 and ASCII.)" – Tomas By Apr 25 '22 at 17:20
  • 2
    "in UNIX filenames" true, then again, while Unix had a huge role in the 1980s to make the internet what it is today, it's neither the only nor even one of the original systems the mal service, and subsequent BASE84, was created for. looking close next to any of the available non-alphanumeric characters will have some special meaning within each system - or will be disallowed (think colon for TOPS-10) – Raffzahn Apr 25 '22 at 19:08
  • As noted by @TomasBy RFC 989 as a precursor to 1421 defines the MIME variant of Base64, and this was in 1987. https://datatracker.ietf.org/doc/html/rfc989 – Joe Apr 26 '22 at 02:39
  • 2
    RFC821 is the wrong standard for the content of an SMTP message. RFC822 is the one you want - or would be if it hadn't been superseded by RFC2822 and the MIME standards. – JeremyP Apr 26 '22 at 07:56
  • 2
    I'm not sure / is disallowed in UNIX filenames. ext2/3/4 are all perfectly happy with it. Of course, naive userspace programs may have issues. – Omar and Lorraine Apr 26 '22 at 10:07
  • 3
    There is a standard for encoding "special" characters in URLs, basically using + for space, and %XX for everything that isn't a "nice" character. Using base64 in URLs is basically using the wrong tool for the task. – Guntram Blohm Apr 26 '22 at 10:10
  • @OmarL Are you sure about /? Many sources, for example wikepedia and the VFS source disagree with the statement. Look in namei.c for disallowing /. – doneal24 Apr 26 '22 at 12:49
  • @OmarL I think you may be right, / might be allowed in UNIX filenames at the API level, but since the shell uses it to render and parse path separators, it's effectively impossible to use. You can't even escape it like \/. Dunno if this has always been the case however, older OS may allow it, while new ones forbid it to prevent footguns. – DeusXMachina Apr 26 '22 at 15:38
  • 3
    / seems like a perfectly valid character to be in base64 - considering that base64 was designed to encode text. Yes, there are computational scenarios where that character means something specific; but I suspect the logic behind implementing it was "Well, it's a key on a typewriter" – Andrew Corrigan Apr 26 '22 at 15:45
  • 2
    / seems like a perfectly valid character The whole premise of the question is that, given the character sets of the time, there are a handful of "perfectly valid" characters, all with specific meanings in some contexts, but / is one of the least-appealing ones, given the alternatives. – DeusXMachina Apr 26 '22 at 16:01
  • 1
    @DeusXMachina According to the FVS implementation, if a slash is found in traversing a path/file name then it is treated as either a path or a trailing symlink. The Linux kernel interface will not allow slashs. I have not checked BSD or UN*X source. My Lion's guide is at work. – doneal24 Apr 26 '22 at 16:01
  • @doneal24 - Nice find! Yeah, I can't rule out that it might have been valid at some point, but there's certainly a bevy of evidence that / as part of a file/dir name is bad vibes at best, and even if it wasn't explicitly disallowed, it probably caused all sorts of havoc if you ended up on a system with / in a name – DeusXMachina Apr 26 '22 at 16:07
  • 1
    @DeusXMachina I can't prove it but I would say was disallowed in UFS back in the 4.1/.2/.3 BSD systems. – doneal24 Apr 26 '22 at 16:11
  • 1
    @Tomas By - the problem is, anytime someone uses base64 when they should have used base64url, it inevitably causes bugs. Someone either uses the wrong decoder, or tries to put in a query param, or does not escape it correctly. It's a pain, yet the naming makes it seem like it is "the default". And that bugs me, enough to write a question on retrocomputing SE :p . – DeusXMachina Apr 26 '22 at 16:32
  • "=+- has some nice symmetry to it" - I'm leaning towards visual distinctiveness, then (as noted by somebody somewhere). =+/ seem to be the simplest and clearest of those non-alphanum chars. – Tomas By Apr 26 '22 at 20:37
  • @GuntramBlohm: One drawback of URL %XX (or MIME quoted-printable =XX) encoding is that, in the worst case, it can triple the length of a string, whereas Base64 only expands its input by 33%. – dan04 Apr 27 '22 at 00:00
  • 1
    What really bothers me would be + and not /, as + has some meaning in regexps and thus searching for something in base64 with a regex : only that char needs to be escaped with a backslash (\\) ... which I would find annoying. But using fgrep instead would solve this easily – Olivier Dulac Apr 27 '22 at 16:24
  • @TomasBy - I'm leaning towards visual distinctiveness base64 strings are even less meant for human consumption than URLs. But I can see that being a facet that influenced the original choice. – DeusXMachina Apr 27 '22 at 18:53
  • Bourne Shell and Awk are still a thing. Life would be much less sweeter without them. – Kingsley Apr 28 '22 at 01:57
  • 1
    They picked 64 characters from the 33-126 range that would work. – Thorbjørn Ravn Andersen Dec 10 '23 at 13:01
  • 1
    This question seems to boil down to "why didn't the designers of a thing for one purpose take into account some entirely different purpose?". – dave Dec 10 '23 at 13:52
  • I like this question. It needs a good home. I was looking for essentially "why did they fill up the last two slots with '/' and '+'?" and this suffices. You know. "They." – MrBoJangles Jan 31 '24 at 20:58

2 Answers2

30

I'm not aware of a (published) rationale for the choice of '+' and '/' as encoding characters, as well as '=' for padding / end-of-message, and I strongly suspect there isn't one.

Base64 was designed as a tool for encoding (8-bit) binary data so that it passes safely and reliably through systems that can handle only 7-bit-ASCII printable characters, and that may insert, delete or modify whitespace along the way[1][2].

Like the earlier uuencode, it does so by taking three 8-bit words, and chopping them into four 6-bit words, each of which is then assigned a printable, non-whitespace character. Unlike uuencode, base64 has the nice property that all the characters used exist in all variations of ASCII and EBCDIC in use at the time[3]. Having used upper case letters, lower case letters and digits, the designers were still two characters short, and made a choice. A different encoding scheme, xxencode, chose '+' and '-' (and a different arrangement of the other characters).

It's worth bearing in mind that base64 encoding was never intended to be used in filenames, and URLs wouldn't exist for another seven years[4].

[1] e.g., by switching between CR LF, CR and LF for line breaks

[2] such as e-mail, at the time

[3] unlike, say, the underscore, which does not exist in the 1963 version of ASCII and Commodore's PETSCII derived from that, rather than the 1967 version of ASCII everybody else was using in the 80s

[4] counting from RFC 989 / Feb 1987 to RFC 1738 / Dec 1994

Michael Graf
  • 10,030
  • 2
  • 36
  • 54
  • 2
    This doesn't seem to actually answer the question though... maybe there was some reason for not using e.g. star instead. – Tomas By Apr 25 '22 at 18:35
  • 4
    @TomasBy — you're right, but (a) it was way too long for a comment, and, (b) I very strongly suspect that there is no answer, or at least not a documented one, and the choice was completely arbitrary. – Michael Graf Apr 25 '22 at 18:39
  • 6
    @TomasBy Looks like a good answer as it not only considers the basic fact, that BAS64 is a scheme prior to URLs (so one would rather ask why URL use '/'), but as well bases it on the requirement to be compatible with as many existing codes as possible. - even the most basic ones. After all, the world nether was nor is ASCII only. 26+26+10 letters/numbers assumed, only 2(3) additional characters are needed. Here a look at the international variants of ITA2 (covered in almost all later codes) ist mandatory, leaving 10 (well 6) characters, so selection between them is arbitrary. – Raffzahn Apr 25 '22 at 18:46
  • 2
    @Raffzahn As far as why URL use '/', that fits well with *nix paths. But confusion among non-geeks is unbelievable - I have heard many times on radio ads: "go to example.com backwards-slash something" - drives me crazy! – manassehkatz-Moving 2 Codidact Apr 25 '22 at 19:10
  • 1
    @manassehkatz-Moving2Codidact I had a feeling I was aware why, so thanks for confirming :)) Point wasn't about learning that fact, but the original questions line of thought. – Raffzahn Apr 25 '22 at 19:13
  • @Raffzahn I don't think we know it was arbitrary. It could be something to do with the position those chars had in various char sets then in use, or some reason for reserving some chars for some other purpose. – Tomas By Apr 25 '22 at 19:52
  • 2
    @TomasBy You're right that we don't know, but looking at both ASCII and EBCDIC charts, there's nothing that makes these choices stand out as special. Using ':' and ';' right after the numerals would have been a more "logical" choice from an ASCII point of view, but then, one might as well have used uuencode, which is an almost-straightforward mapping into ASCII. – Michael Graf Apr 25 '22 at 20:12
  • 6
    We shouldn't ignore the visual aspect. Sure, it's probably not the case that anyone's typing in a lot of base64, but it's still good to be able to compare characters "at a glance", From that point of view, ':' and ';', or '.' and ',' are not good choices, – dave Apr 25 '22 at 22:38
  • 2
    Re, "...[2] such as e-mail, at the time..." You can say that again. And again, as many times as you like. If I had to guess, I'd guess that SMTP and FTP were the only two serious contenders for "most important internet protocol" circa 1980, and SMTP was the one that suffered the 7-bit ASCII limitation. – Solomon Slow Apr 25 '22 at 23:08
  • @another-dave Well, if typing (or only reading) would have been a concern, then the second character block (lower case) would have been aligned to x'20' instead of x'1A', wouldn't it? (Or to x'01' and x'21'). So certainly readability wasn't a concern at all. – Raffzahn Apr 26 '22 at 05:56
  • 9
    Here's a theory. SMTP has headers in which : definitely and ; up to a point have special meanings. Same applies to HTTP of course. The character set may have been picked to make it easier to put base64 in a mail header. – JeremyP Apr 26 '22 at 07:59
  • 1
    URLs wouldn't exist for another seven years I allude to that in my question. I'm aware b64 predates URLs, but URLs are definitely influence by UNIX paths, which predate both by years, and the usage of / for X/Y predates paths by decades. I think the most promising route is looking into the excluded charset of SMTP itself. – DeusXMachina Apr 26 '22 at 15:30
  • unlike, say, the underscore, which does not exist in the 1963 version of ASCII - sure, but -+= are there, and _ was there by ASCII-1965 – DeusXMachina Apr 26 '22 at 15:59
  • 4
    @DeusXMachina: While underscore may have been defined in ASCII 1965, many devices continued to have ASCII-ish character sets that re-purpose some codes for alternative glyphs. A substantial fraction, if not a majority, of devices with small 5x7-matrix text displays use a display font that replaces backslash with a yen symbol and tilde with a right-facing arrow, Closed-captioning character set replaces asterisk and underscore, among other characters, with accented lowercase letters. – supercat Apr 26 '22 at 20:47
18

This answer is speculation but it's too long for a comment and I suspect any answer is likely to involve some speculation.

We can however look to RFC821, the famous Simple Mail Transfer Protocol (SMTP).

RFC821 defines the mechanism used to transmit SMTP messages across the Internet. The content of the messages is outside of its scope. The original definition of the content and structure of a message is in RFC822. However, both of these RFCs long predate base64.

All this put together makes / a really strange choice for the fledgling base64 encoding alphabet, even in a pre-URL age

Not really. After all, it was perfectly possible to encode a path in base64. The only problem arises when people try to read a base64 field as unencoded text and nobody would be that stupid, would they?

Anyway, in the era when base64 was created, the most likely use of it would be in the body of an email, or perhaps in a header of an email. Even today you are more likely to see it in the body of an HTTP message or the headers than in the URL*. Several of the characters you listed ,.@!$&% have special meanings in the context of an SMTP header, as do two others suggested in another answer i.e. :;. : is used to denote the end of the header key. ; is often used in headers to separate tokens. @, ! and . are used in email addresses. They probably chose the least worst characters in the context of the time in which base64 was created and in the context for which it was designed.

*I would regard base64 encoded data in a resource path to be an abuse of the standard. Possibly there's a case for get parameters to be base64 encoded but there are better ways even for that. Why would you want to embed binary data in something that's supposed to be human readable?

JeremyP
  • 11,631
  • 1
  • 37
  • 53
  • 3
    URLs aren't “supposed” to be human-readable. It's nice when they are, and for certain types of more or less static content this a is realistic goal, but for lots of applications it's futile and pointless to try making the URL human-understandable because there's way too much information. Even in URLs that contain a human-readable part, this is often only a comment to an opaque unique identifier. E.g. retrocomputing.stackexchange.com/questions/24394/bralum_wrodan+blurum-hata directs to this page just fine. – leftaroundabout Apr 26 '22 at 11:11
  • 8
    @leftaroundabout - Maybe now they aren't "supposed" to be, but I suspect that the original intent was that humans could make sense of them. For comparison, see internet hostnames: there's no requirement for being readable, but implicitly, they were. – dave Apr 26 '22 at 12:14
  • 2
    Several of the characters you listed ,.@!$&% have special meanings in the context of an SMTP header - Yes, this is exactly the kind of details I was looking for, thank you! – DeusXMachina Apr 26 '22 at 15:26
  • 1
    Are you sure about !$&% having special meaning in SMTP headers? Referencing 3.3. LEXICAL TOKENS in 822, I see no mention of those. Also that still allows for - – DeusXMachina Apr 26 '22 at 17:42
  • 8
    @DeusXMachina, "!" is used in bang path addresses. More generally, once you take the intersection of EBCDIC, ASCII-1967, ASCII-1963, and ISO/IEC 646, subtract out the ISO/IEC 646 combining characters, and subtract out the characters with special meaning in NNTP, UUCP, or email, you're left with a rather small set of characters: ()+*?=-/. – Mark Apr 26 '22 at 23:17