18 



Statistical Relations in the Amino Acid Order 

 of Escherichia coli Protein 1 



HAROLD J. MOROWITZ 



Biophysics Department, Yale University 



Since proteins consist largely of chains of amino acids, a formal analogy is 

 suggested between protein structure and written language, which consists of 

 chains of letters. The analogy breaks down when we come to consider func- 

 tion. For the function (meaning) of language is completely determined by the 

 sequence of letters, whereas the function of proteins depends on the secondary 

 structure (coiling) was well as on the sequence. (In poetry, secondary struc- 

 ture is extremely important.) 



One characteristic of both proteins and language is the nonrandom frequency 

 of occurrence of letters. Thus if we examine a long passage written in English 

 we get a rank frequency distribution represented by figure 1, which also shows 

 the rank frequency distribution of amino acids in Escherichia coli protein. Fur- 

 ther examination of language reveals certain high-frequency pairs, triplets, and 

 higher groupings of letters [1, 2]. 



A further feature shows up on inspection of amino acid composition of pro- 

 tein. The nonrandom distribution of amino acids that is apparent in over-all 

 collections of proteins arises from a similar nonrandomness in individual pro- 

 teins. There are notable exceptions to these relations, however, particularly in 

 structural proteins like silk and collagen. 



A further question suggests itself. In proteins, are there pairs, triplets, and 

 higher amino acid sequences that occur with unexpectedly high or low fre- 

 quencies ? That is, are there laws, similar to the laws of language, governing 

 the ordering of amino acids in peptide chains ? These statistical relations could 

 be the result of thermodynamic stability, of evolutionary selection, or of the 



1 This research aided by a grant from the United States Public Health Service. 



147 



