54 THE BELL SYSTEM TECHNICAL JOURNAL, JANUARY 1951 



statistical influences, and consequently the A^-grams within words are more 

 restricted than those which bridge words. The effect of this is that we have 

 obtained, in 2.62 bits per letter, an estimate which corresponds more nearly 

 to, say, Fb or Fe . 



A shnilar set of calculations was carried out including the space as an 

 additional letter, giving a 27 letter alphabet. The results of both 26- and 

 27-letter calculations are summarized below: 



Fo 



26 letter 4.70 



27 letter 4.76 



The estimate of 2.3 for Fg , alluded to above, was found by several methods, 

 one of which is the extrapolation of the 26-letter series above out to that 

 point. Since the space symbol is almost completely redundant when se- 

 quences of one or more words are involved, the values of F^ in the 27-letter 



45 

 case will be — or .818 of Fy for the 26-letter alphabet when N is reasonably 



large. 



3. Prediction of English 



The new method of estimating entropy exploits the fact that anyone 

 speaking a language possesses, implicitly, an enormous knowledge of the 

 statistics of the language. Familiarity with the words, idioms, cUches and 

 grammar enables him to fill in missing or incorrect letters in proof-reading, 

 or to complete an unfinished phrase in conversation. An experimental demon- 

 stration of the extent to which English is predictable can be given as follows: 

 Select a short passage unfamiliar to the person who is to do the predicting. 

 He is then asked to guess the first letter in the passage. If the guess is correct 

 he is so informed, and proceeds to guess the second letter. If not, he is told 

 the correct first letter and proceeds to his next guess. This is continued 

 through the text. As the experiment progresses, the subject writes down the 

 correct text up to the current point for use in predicting future letters. The 

 result of a typical experiment of this type is given below. Spaces were in- 

 cluded as an additional letter, making a 27 letter alphabet. The first line is 

 the original text; the second line contains a dash for each letter correctly 

 guessed. In the case of incorrect guesses the correct letter is copied in the 

 second line. 



(1) THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG 



(2) ----ROO NOT-V I SM----OBL---- ^^^ 



(1) READING LAMP ON THE DESK SHED GLOW ON 



(2) REA D SHED-GLO--0-- 



(1) POLISHED WOOD BUT LESS ON THE SHABBY RED CARPET 



(2) P-L-S 0— BU-L-S-0 SH RE-C 



