I 



PREDICTION AND ENTROPY OF PRINTED ENGLISH 51 



erratic and uncertain, and they depend more critically on the type of text 

 involved. 



2. Entropy Calculation from the Statistics of English 



One method of calculating the entropy ^ is by a series of approximations 

 Fq , Fi , F2 , • • ' , which successively take more and more of the statistics 

 of the language into account and approach ^ as a hmit. Fn rnay be called 

 the X-gram entropy; it measures the amount of information or entropy due 

 to statistics extending over T adjacent letters of text. Fat is given by^ 



= -Z p{^i , j) log2 p{bi , i) + Z p{b^) log p{b^). 



i,j i 



in which: ^^ is a block of X-\ letters [(iY-l)-gram] 



j is an arbitrary letter following hi 



p(bi , j) is the probability of the iV-gram bi , j 



pbiU) is the conditional probability of letter j after the block bi, 



and is given hy p{b^, j)/p(bi). 



The equation (1) can be interpreted as measuring the average uncertainty 

 (conditional entropy) of the next letter 7 when the preceding iV-1 letters are 

 known. As .V is increased, Fy includes longer and longer range statistics 

 and the entropy, H, is given by the limiting value of i^jv as iV ^ x : 



H = Lim Fn . (2) 



N-*oo 



The X-gram entropies Fy for small values of N can be calculated from 

 standard tables of letter, digram and trigram frequencies. If spaces and 

 punctuation are ignored we have a twenty-six letter alphabet and Fq may 

 be taken (by definition) to be log2 26, or 4.7 bits per letter. Fi involves letter 

 frequencies and is given by 



26 



Fi = -Z p(i) log2 p(i) = 4.14 bits per letter. (3) 



The digram approximation F2 gives the result 

 ^2 = - Hp(i,j) \og2 pi{j) 



= - Z) pii,j) log2 p(i,j) + E p(d log2 pit) (4) 



i,j 1 



= 7.70 - 4.14 = 3.56 bits per letter. 



2 Fletcher Pratt, "Secret and Urgent," Blue Ribbon Books, 1942. 



