7

I read years ago in a pop science book that written and spoken language can be shown to have a high level of redundancy. The speculation was that this served to allow error correction because language is encountered in noisy environments and understanding has a high survival value. The figure I remember is that if 35% of the information remains an attentive listener or reader can still make out the message. This has become a cherished notion in my world view as an indicator that language and human endeavor contains complexity we are not consciously aware of.

I am looking for a knowledgeable confirmation of the idea that spoken or written language contains this redundancy. An impolite smack down due to unexamined confirmation bias is always welcome as well.

Steven Jeuris
  • 3,523
  • 5
  • 30
  • 56
timquinn
  • 193
  • 5
  • 1
    Our smack downs are always polite. ;) Interesting question. Do you recall the name of the popular text? – Chuck Sherrington Jan 29 '13 at 05:35
  • I think it was called Rules of the Game. It was about self organizing systems. I have it buried in a box. Amazon has many books under that name, can't remember the authors name, couldn't find it. – timquinn Jan 29 '13 at 06:27
  • It was "Laws of the Game : How the Principles of Nature Govern Chance" by Manfred Eigen and Ruthild Winkler. It is from 30 years ago, so the idea may have come from some other book I read at the time. – timquinn Jan 29 '13 at 10:05
  • @timquinn Could you please add that extra relevant information into your question if it turns out to be the book you are talking about, also providing a link to it? Thank you. – Steven Jeuris Jan 29 '13 at 10:36
  • This relates to ambiguity in language (since redundancy often reduces ambiguity). Zipf suggested that the speaker wishes for an ambiguous language where they can use one sound to mean everything, and leave difficulty of disambiguation for the listener. The listener, on the other hand, wishes for a totally unambiguous language, so the difficulty of picking the right words is on the speaker, and the listener doesn't need to spend energy on disambiguation. I would expect redundancy to have similar driving forces. Here is the dual of your question. – Artem Kaznatcheev Feb 01 '13 at 04:07
  • I can't comment on your other question so I will leave this here. I have never read this book, but Brian Eno used to talk about it a lot and it was quite well known in the good old modern era. It is Seven Types of Ambiguity from 1930. An exercise in attempting to classify metaphor. http://www.amazon.com/Seven-Types-Ambiguity-William-Empson/dp/081120037X/ref=sr_1_1?ie=UTF8&qid=1359699256&sr=8-1&keywords=7+types+of+ambiguity – timquinn Feb 01 '13 at 06:17
  • Zipf's scheme looks more like the set up for a thought experiment than a description of reality. Embedded redundancy used for error correction would be something that emerged over generations through uncouncious (?) experimentation by a whole continent of speakers working to understand each other in life or death situations. – timquinn Feb 01 '13 at 07:04
  • (I don't presume to tell you your field, just setting up my point) I suspect Zipf comes from a generation that imagined language being worked out by gray men sitting in libraries. I imagine a hunter shouting over screeching birds and water falling that his partner should look out for that big bear just behind him. He isn't worried about who's responsibility it is to provide the information or decode it. He just wants to save his friends life and will continue shouting things until he gets noticed. – timquinn Feb 01 '13 at 07:05
  • He will remember what worked and start there next time, even if it is just to ask for the other drumstick. He won't know why what he did worked and he won't be too concerned about it and it will become part of the larger language because it works. This is, apparently, why pronouns have gender, for example. To add another data point for the listener when trying to decipher noisy signal. My curiosity was to find out what the field thought about this and to find some author names or key words to search. A lot of the time that is the hardest part for a non-initiate. – timquinn Feb 01 '13 at 07:08
  • Here is a curious thing for you, Steven Jeuris. When I went and looked up Eigen's book on Amazon I saw the cover that I remembered and instantly knew it was not that book, but another, that I had gotten this fact from. I can vaguely recall the cover of the book in question, but that is all. Seeing the cover of Laws of the Game I remember was enough to allow me to know it was that other book. Weird. Hence, I did not add it to the question. – timquinn Feb 01 '13 at 08:15

2 Answers2

9

It seems that for written English, the figure is 50%.

From pages 27 to 28 of The making of cognitive science: Essays in honor of George A. Miller (Cambridge: Cambridge University Press, 1988)

Estimates of redundancy. Shannon (1948, 1951) had himself estimated the redundancy of printed English to be about 50 percent. He had used a technique in which a subject was given a passage of text and then required to guess the next letter until the correct response (i.e., that corresponding to the original text) was given. Redundancy was calculated from the distribution of the numbers of guesses required. Garner and Carson (1960) [...] also estimated the redundancy of printed English to be about 50 percent. Newman and Gerstman (1952) [...] estimated redundancy to be 52 percent.

Uses of redundancy. [...] Chapanis (1954) and Miller and Friedman (1957) both showed that when text was mutilated by deleting different percentages of letters, subjects were able to restore the missing letters with a high degree of accuracy. Such restoration is possible because of redundancy, so these experiments showed that redundancy was useful to humans.

[...]

In summary, printed English is redundant, and thus constrained, both in letter sequences within words and in sequences of words themselves. This redundancy is known to humans, who can use it to reconstruct mutilated text and to recognize and learn words and sequences of words that reflect varying degrees of this constraint. [...]

From page 1086 of A new kind of science by Stephen Wolfram (Wolfram Media, Inc., 2002):

[...] English text typically remains intelligible until about half its characters have been deleted, indicating that it has a redundancy of around 0.5. Most other languages have slightly higher redundancies, making documents in those languages slightly longer than their counterparts in English.

  • 1
    Holy wow, Shannon and Wolfram. Nice work. – timquinn Jan 29 '13 at 11:23
  • I read a little about this experiment of Shannon's. Do you think it would hold up to present day standards? I am sure that Wolfram is referring to Shannon. I wonder if there is any current science on the subject. – timquinn Jan 29 '13 at 11:27
  • 1
    It would also be interesting to see whether any work is done on the redundancy of spoken English. It probably better reflects realistic scenarios in which the speculation makes sense (noisy environments). – Steven Jeuris Jan 29 '13 at 12:20
  • Yes, i agree, Steven. It would be more relevant to my question if the research involved spoken English. The Shannon experiment sounds more like solving a crossword puzzle, a very conscious act. – timquinn Jan 29 '13 at 13:45
  • I am going to give Joel the green check. I am sure you have nailed the source of my initial exposure to this and I thank you for that. That it turns out to be Claude Shannon is very interesting because I realized that the notion has become a sort of received wisdom that has been repeated by a lot of pop and legitimate science writers. His work was about written language, though, so it does not actually support my long held belief in instantaneous error correction. I am going to formulate another question. – timquinn Jan 30 '13 at 06:07
  • @timquinn You partially got this answer because you didn't phrase your question that clearly. Where I interpreted the 'speculation' of the usefulness in noisy environments to be an important aspect, others did not. Remember, keep your questions concise, focused and clear. To better differentiate from your newly asked question, please update this one. – Steven Jeuris Jan 30 '13 at 09:30
  • Well, for now I did it for you ... Remember it the next time you ask a question. ;p It will get you clearer/narrowed down answers. – Steven Jeuris Jan 30 '13 at 09:36
0

The fact that you can tell when I use "it is" or "its" incorrectly is confirmation of redundancy. If there was no context telling you which it should be, you wouldn't know; and if there is context telling you what it should be, there is no need to write the apostrophe because you already know what the writer intended.

That much is fairly obvious. Further, data compression can give another insight into this. Standard text compression yields a much higher compression ratio than 50%. If I compress your question with "gzip" (a standard compression utility), it goes down from 770 bytes to 440 bytes (57% of the original). This is fairly bad, but this is with zero pre-existing knowledge -- something humans have loads of when processing text.

If I prepend what is currently on Wikipedia's front page (some text about "Suillus salmonicolor", some fungus), compressing your question takes an additional 378 bytes (49% of the original), probably because it can replace words like "and" with a very short code without having to pre-define it.

It doesn't end there either: trying to remove redundancy (e.g. writing words in conjugated form is almost always unnecessary; as are words like "she" in the phrase "Sarah went to the store when she needed something" ("Sarah went to the store when needed something" gets the message across; or at least "he" or "it" (both are a character shorter) would suffice).

I have often wondered how much redundancy is in languages: both in its basic form (sentences as an abstract concept) as well as in its representations (written characters or spoken sounds). I've come to terms with it because it must be for error correction. I probably wouldn't catch half of what my aunt is telling at the Christmas dinner table if not for that.

A related experiment that ran in 2013 looked at how much of a sentence you can hide before it becomes unreadable: https://lucb1e.com/rp/js/read.html For me, more than half can be hidden with no problems at all, meaning the character shapes that we use are >50% redundant as well, at least when combined with the redundancy in words themselves (because an h of which the top is hidden look like an n, but the word probably only makes sense with one of the two).

Luc
  • 101
  • 2