108 L. G. AUGENSTINE 



be at least forty amino acid residues or longer. Comparing the sequences 

 of different types of proteins indicated that (a) there is not a master sequence 

 operating among species, or (b) evolution, i.e. amino acid substitution, has 

 been so extensive as to make it undetectable, or (c) the master sequence is 

 200 residues or longer. The additional sequences (for hormones of sub-protein 

 size) cited by Ycas (7) show that short polypeptide sequences with only minor 

 amino acid differences do occur in cells of different species. Thus, the occurrence 

 of repeating or a restricted number of amino acid sequences may be an explana- 

 tion of the unequal amino acid frequencies observed. 



This possible restriction provides a basis for estimating the minimum 

 value of Ig. A single, long, completely-detennined sequence would provide 

 a situation of minimum infonnation content for polypeptides selected from it. 

 To select A'^ residues from a sequence of S amino acids would require < log2 S 

 bits to find N and < logo {S — N) bits to determine the starting point; or 

 by another selection procedure, < logg (5" — 1) to find the starting point and 

 roughly logg S/l to determine the end point. Either of these methods of 

 selection gives an estimate of the minimum of /^ which is of the order of 2 log2 

 S bits. This is a very low minimum since according to the best present estimate 

 (which is obviously too low) S f^ 200 and thus 2 logo 5 ^ 15. Therefore, 

 the minimum of/,, is of the order of 0.1 bit/residue since A^ > 100 for proteins. 

 Even if 5" is found to be 10^ (2 logg S)IN will still only be ~ 0.4 bits/residue. 

 Thus, the search (6) for long master sequences of amino acids is of considerable 

 interest with respect to information content considerations. 



Summarizing for 7^, we can say that for nonstructural proteins the potential 

 information due to the amino acid sequence should be of the order of 0.85-0.95 

 of the possible maximum value. Although the constraints necessary to produce 

 such an effect should be of the same order of magnitude as those in printed 

 English, tests comparing language and the available proteins for which amino 

 acid composition or sequences are known indicate that the constraints operating 

 in the elaboration of proteins are probably different from those associated 

 with language. Further, it seems unhkely that the unequal frequency of amino 

 acids in proteins is due to unequal availability of the amino acids in the cellular 

 pool. The possibility that polypeptide chains are segments selected from a 

 single or restricted number of repeating sequences may be an explanation of 

 the unequal frequencies, in which case /^residue would be close to zero. 



Configuration '• With the present state of knowledge the factors affecting 

 /^ are much more difficult to assess. The number of states available to a poly- 

 peptide chain whose bonds retained all of the lability they had as uncombined 

 amino acids would be essentially innumerable. In fact, about the only con- 

 figurations ruled out would be those resulting in closure of the chain upon 

 itself. However the D- and L- forms do not both exist in nature and as has been 

 pointed out by Pauling, Corey and Branson (8), the a-C, N and O group 

 in the backbone of the polypeptide chain is essentially the planar, resonance 



O 



/ 

 / 



structure — C -N— . Other than these primary restrictions the polypeptide 

 chain, in the absence of intramolecular or secondary bonding structures is 

 essentially a random structure. 



