Prediction and Entropy of Printed English 



By C. E. SHANNON 



{Manuscript Received Sept. i^, igso) 



A new method of estimating the entropy and redundancy of a language is 

 described. This method exploits the knowledge of the language statistics pos- 

 sessed by those who speak the language, and depends on experimental results 

 in prediction of the next letter when the preceding text is known. Results of 

 experiments in prediction are given, and some properties of an ideal predictor are 

 developed. 



1. Introduction 



IN A previous paper^ the entropy and redundancy of a language have 

 been defined. The entropy is a statistical parameter which measures, 

 in a certain sense, how much information is produced on the average for 

 each letter of a text in the language. If the language is translated into binary 

 digits (0 or 1) in the most efficient way, the entropy H is the average number 

 of binary digits required per letter of the original language. The redundancy, 

 on the other hand, measures the amount of constraint imposed on a text in 

 the language due to its statistical structure, e.g., in English the high fre- 

 quency of the letter E, the strong tendency of H to follow T or of U to follow 

 Q. It was estimated that when statistical effects extending over not more 

 than eight letters are considered the entropy is roughly 2.3 bits per letter, 

 the redundancy about 50 per cent. 



Since then a new method has been found for estimating these quantities, 

 which is more sensitive and takes account of long range statistics, influences 

 extending over phrases, sentences, etc. This method is based on a study of 

 the predictability of English; how well can the next letter of a text be pre- 

 dicted when the preceding N letters are known. The results of some experi- 

 ments in prediction will be given, and a theoretical analysis of some of the 

 properties of ideal prediction. By combining the experimental and theoreti- 

 cal results it is possible to estimate upper and lower bounds for the entropy 

 and redundancy. From this analysis it appears that, in ordinary literary 

 English, the long range statistical effects (up to 100 letters) reduce the 

 entropy to something of the order of one bit per letter, with a corresponding 

 redundancy of roughly 75%. The redundancy may be still higher when 

 structure extending over paragraphs, chapters, etc. is included. However, as 

 the lengths involved are increased, the parameters in question become more 



^ C. E. Shannon, "A Mathematical Theory of Communication," Bell System Technical 

 Journal, v. 27, pp. 379-423, 623-656, July, October, 1948. 



50 



