Protein Structure and Information Content 107 



q and u) no significant results were detected. Thus it can be concluded that 

 such analyses do not exclude intersymbol influences of the same type or order 

 of magnitude as those in language.* 



Gamow, Rich, and Ycas (5) have made a more exacting study of possible 

 inter-symbol influences affecting amino acids. They treated the known amino 

 acids as a series of dipeptides which they tallied into a 20 X 20 matrix similar to 

 the 26 >: 26 digram matrices common in language analyses. The distribution for 

 nonstructural proteins in such a 20 X 20 matrix followed quite closely a 

 Poisson distribution. This they state is compatible with the assumption that 

 the occurrence of a given amino acid does not affect the identity of its nearest 

 neighbor. Their comparable analysis for English language gave a distribution 

 which deviated from a Poisson. 



The Poisson distribution associated with the amino acid dipeptide analysis 

 is not too significant since the sample of experimentally determined sequences 

 is not necessarily a reliable representation of the bulk of amino acid sequences 

 in nature. As Gamovv', Rich, and Ycas point out, their available sample is 

 strongly affected by the composition of ACTH, lysozyme and insulin for which 

 the complete sequences have been determined and the shorter sequences from 

 other proteins are biased due to differential bond labilities within the protein 

 which give rise preferentially to certain amino acids occurring as terminal 

 peptides in the sequences isolated. 



It was felt that a possible explanation of the difference noted between 

 digram analysis of letters and amino acids was that amino acids were also 

 grouped into word-like structures but that the average number of symbols 

 per 'word' was different than that found in English. Therefore, separate 

 digram analyses were performed on English words having two to five letters, 

 six to nine letters and those having ten or more. All the samples were selected 

 so that the average cell density in the 26 x 26 matrix was 0.44, the same as 

 that of Gamow, Rich, and Ycas, and these also all showed significant deviations 

 from a Poisson distribution. 



MoROwiTZ (6) and some of the Biophysics group at Yale have been investi- 

 gating the possibility that a polypeptide chain is a segment selected from either 

 a single or a small number of repeating sequences which are invariant for a 

 given chromosomal complement. The particular segments chosen and the 

 unique fashion in which they are combined and folded would then account 

 for the highly specific properties of the individual proteins. The possibiHty 

 also exists that there was an initial long, or at least restricted, set of sequences 

 from which present day polypeptide sequences have evolved in a manner similar 

 to that by which organisms have evolved. Gamow^, Rich, and Ycas (5) have 

 pointed out the most striking evidence for a "phylogenetically common ancestral 

 sequence" in their comparison of the A and B chains of insulin, where the 

 same amino acids occur in equivalent positions in both chains four times. 



The known sequences containing five amino acids or more (from Table I, 

 ref. 5) were examined for repeating or matching sequences. (This was done by 

 superposing the sequences in all possible permutations.) These data indicate 

 that for proteins from a given species any single repeating sequence must 



* See the discussion by Dr Platt at the end of this paper. 



