56 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY 1951 



first line is the original text and the numbers in the second line indicate the 

 guess at which the correct letter was obtained. 



(1) THERE IS NO REVERSE ON A MOTORCYCLE A 

 (2)1115112112 1115 117 1112132122711114111113 1 

 (1) FRIEND OF MINE FOUND THIS OUT 

 (2)861311111111111621111112111111 



(1) RATHER D RAMATICALLY THE .OTHER DAY 



(2) 4 1 1 1 1 1 1 11 5 1 1 1 1 1 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 1 1 1 1 1 (9) 



Out of 102 symbols the subject guessed right on the first guess 79 times, 

 on the second guess 8 times, on the third guess 3 times, the fourth and fifth 

 guesses 2 each and only eight times required more than five guesses. Results 

 of this order are typical of prediction by a good subject with ordinary literary 

 English. Newspaper writing, scientific work and poetry generally lead to 

 somewhat poorer scores. 



The reduced text in this case also contains the same information as the 

 original. Again utilizing the identical twin we ask him at each stage to guess 

 as many times as the number given in the reduced text and recover in this 

 way the original. To eliminate the human element here we must ask our 

 subject, for each possible iV-gram of text, to guess the most probable next 

 letter, the second most probable next letter, etc. This set of data can then 

 serve both for prediction and recovery. 



Just as before, the reduced text can be considered an encoded version of 

 the original. The original language, with an alphabet of 27 symbols, A, 

 B, — , Z, space, has been translated into a new language with the alphabet 

 1, 2, • • • , 27. The translating has been such that the symbol 1 now has an 

 extremely high frequency. The symbols 2, 3, 4 have successively smaller 

 frequencies and the final symbols 20, 21, • • • ,27 occur very rarely. Thus the 

 translating has simplified to a considerable extent the nature of the statisti- 

 cal structure involved. The redundancy which originally appeared in com- 

 plicated constraints among groups of letters, has, by the translating process, 

 been made explicit to a large extent in the very unequal probabilities of the 

 new symbols. It is this, as will appear later, which enables one to estimate 

 the entropy from these experiments. 



In order to determine how predictability depends on the number N of 

 preceding letters known to the subject, a more involved experiment was 

 carried out. One hundred samples of EngHsh text were selected at random 

 from a book, each fifteen letters in length. The subject was required to guess 

 the text, letter by letter, for each sample as in the preceding experiment. 

 Thus one hundred samples were obtained in which the subject had available 

 0, 1, 2, 3, • • • , 14 preceding letters. To aid in prediction the subject made 

 such use as he wished of various statistical tables, letter, digram and trigram 



