THE PROTEIN TEXT 



Martynas Ycas 



Department of Microbiology, State University of New York 

 Upstate Medical Center, Syracuse, New York 



And strange to tell, among that Earthen Lot 

 Some could articulate, while others not: 



And suddenly one more impatient cried — 

 'Who is the Potter, pray, and who the Pot ?' 



The Book of Pots 



Abstract — The sequence of residues in proteins, regarded as a text written in a twenty symbol 

 alphabet, is examined. The following tentative conclusions are drawn: 



1. Twenty amino acids are distinguished by the protein-forming mechanism. Super- 

 numerary amino acids arise from the regular twenty by secondary modification of protein- 

 bound residues. 



2. Each residue in the protein has a separate genetic representation. 



3. There is no intersymbol correlation between adjacent residues. 



4. Natural selection is not the only factor determining the frequency of occurrence of the 

 various kinds of residues. It is suggested that the method of encoding protein sequence 

 information in nucleic acid imposes differences in frequency of occurrence on the different 

 kinds of residues. 



5. Peptide chains are not multiples of some fixed number of residues. 



The encoding and transfer of genetic (DNA) information to RNA and protein is discussed, 

 as well as the problem of the independent reproduction of RNA viruses. While the data set 

 certain limits on the possible ways of encoding and transferring information, they are not 

 sufficient for a unique solution of these problems. 



Ribonucleic acid of Tobacco Mosaic Virus (TMV) has been shown to deter- 

 mine the sequence of amino acid residues in the protein of the virus (1, 2, 3). 

 It seems logical therefore to believe that the sequence of other proteins is also 

 determined by RNA.* 



Since RNA is essentially a linear sequence of four kinds of nucleotides, 

 while proteins are linear sequences of about twenty kinds of amino acid residues, 

 the RNA molecule can be regarded as a text, written in a four-symbol alphabet, 

 which encodes another text, the protein, written with about twenty symbols. 



* The following abbreviations will be employed. RNA — ribonucleic acid; DNA — deoxy- 

 ribonucleic acid; Ad — adenylic acid; Gu — guanylic acid; Cy — cytidylic acid; Ur — uridylic 

 acid; ala — alanine; arg — arginine; asp — aspartic acid ; aspn — asparagine; asx — asparticacid 

 or asparagine; cys — cysteine; glu — glutamic acid; glun — glutamine; glx — glutamic acid or 

 glutamine; gly — glycine; his — histidine; ileu — isoleucine; leu — leucine; lys — lysine; met — 

 methionine; phe — phenylalanine; pro — proline; ser — serine; thr — threonine; try — trypto- 

 phan; tyr — tyrosine; val — valine; Hlys — hydroxylysine ; Hpro — hydroxyproline; serP — 

 phosphoserine. Peptides are written with the amino group to the left, the symbols being 

 connected by a dash ( — ). The sign (*) signifies a terminal residue. Sequences considered 

 uncertain are in parentheses ( ). Symbols in parentheses, with commas between (ala, gly) 

 mean that the sequence is not known. 



70 



