Protein Structure and Information Content 105 



and the other (I^) upon the configurations of the polypeptide chain in the 

 native molecule. Treating sequence and configuration independently should 

 lead to overestimates of 1„ since the pennissible configurations will depend 

 upon the sequence. However, care has been taken to reduce the interaction 

 of the two terms as much as possible, so that for the purposes of this paper no 

 significant discrepancies should occur. 



Sequence' There are twenty amino acids which are most commonly incor- 

 porated into proteins. Therefore the maximum value of /^ is 4.32 bits (logg 20) 

 per amino acid residue.* It would occur when the twenty amino acids occur 

 equiprobably. Values less than the maximum would occur due to any con- 

 straints upon the amino acid sequence. Branson (I) calculated /, of twenty-six 

 proteins for wliich the frequency of occurrence of the twenty amino acids had 

 been determined (disregarding possible sequential dependencies). He found 

 that those which formed part of a living structure of an organism had an ^ 

 which was greater than 0.70 of the maximum value. His analysis is shown by 

 the dots in Fig. 1. The X's show the result of a similar analysis on language 

 samples. The language study was based on ten paragraphs chosen from diverse 

 sources such as want ads, newspaper articles, textbooks, and magazines and 

 differs from that usually used in analysis of language in that it is based on the 

 paragraph rather than on large continuous samples.! In this case, letters have 

 been treated like amino acids and paragraphs like proteins. Except for the 

 single value of 0.99 the values from proteins and paragraphs agree quite 

 well. 



Similarities between the distribution of amino acid frequencies and letters 



can be seen further in Fig. 2. There the ordinate indicates the number of 



times that a particular normalized frequency occurs ; the normalized frequency 



is the number of times, n^, that the /th symbol (either amino acid or letter) 



occurs, divided by N/m, the expected number of times that each type of symbol 



should occur if all m different kinds of symbols had equiprobable occurrence 



in the sample of TV symbols. As can be seen in Fig. 2 the distribution of the 



n ■ 

 normalized frequencies -ttt- for the letters (solid fine) and the amino acids (shaded 



^ A'//?; 



area) are almost identical except for the higher incidence of rarely-used letters 

 in language. This small difference might not have occurred if some of the 

 rarer amino acids, for which assays are difficult, had been included in the 

 data. 



Constraints — The fact that the distribution of amino acids in non-structural 

 proteins deviates from equiprobability about the same as (or possibly a little less 

 than) the letters in written English, indicates that the constraints producing such 

 unequal frequencies should be of the same order of magnitude as (or slightly 

 less than) those governing English texts. However, this tells nothing about the 



* This value disregards any influence of residue 'complexions'. However, it is difficult to 

 see how factors other than the identity of the residues can be very important, when one con- 

 siders the freedom of rotation of the /^-groups with respect to the polypeptide chain. 



t It was felt that such a small-sample statistics study was preferable to one based upon large 

 samples (such as a determination of confidence intervals for /, as a function of the paragraph 

 size), since by essentially duplicating the analyses applied to proteins, insightas to the limita- 

 tions of that procedure could be observed. 



