PREDICTION AND ENTROPY OF PRINTED ENGLISH 



53 



formula (6) clearly cannot hold indefinitely since the total probability S^„ 



00 



must be unity, while S A/n is infinite. If we assume (in the absence of any 

 1 



better estimate) that the formula pn = .1/w holds out to the n at which the 



5 



z 



■UJ 



a 



a 0.001 



u. 



o 

 ca 

 o 



0.0001 



r 



0.00001 



1 2 468 10 20 4060 100 200 400 1000 2000 4000 10,000 



WORD ORpER 



Fig. 1 — Relative frequency against rank for English words. 



total probability is unity, and that pn = for larger n, we find that the 

 critical n is the word of rank 8,727. The entropy is then: 



8727 



— 12pn log2 pn = 11.82 bits pcr word, (7) 



1 



or 11.82/4.5 = 2.62 bits per letter since the average word length in English 

 is 4.5 letters. One might be tempted to identify this value with 7^4.6 , but 

 actually the ordinate of the Fn curve at N = 4.5 will be above this value. 

 The reason is that F^ or F5 involves groups of four or five letters regardless 

 of word division, A word is a cohesive group of letters with strong internal 



