yo ERWIN CHARGAFF 



by us to preparations from calf thymus [14] and Escherichia coli protoplasts 



[15]. 



THE PROBLEM OF STATISTICAL SEQUENCE ANALYSIS 



The possibility that no two nucleic acid molecules within the same 

 nucleus are entirely identical offers ' ' a prospect that would seem to condemn 

 us to forced statistics for life" [16]. It is this more than anything else that 

 has made work on the nucleotide sequence in nucleic acids appear so un- 

 attractive. Who would, after all, undertake to read a book that has been 

 passed through a grinder ? Nevertheless, being rather modest in what I 

 expected to gain from a perusal of the nucleic acid text, I have never been 

 able to share these apprehensions completely. I knew that a great deal can 

 be learned about an unknown language through a study of its phonemes, 

 their frequency, distribution density, and allophonic relationships. 



If the total deoxyribonucleic acid of a given species represents a text, 

 it is made up of "words" — the individual molecules — that are composed 

 of a singularly meagre alphabet: four or five letters. But the words so 

 spelled out are 10 000-letter words, each of which could occur in a fan- 

 tastically great number of positional isomers: between lo-^oo ^nd 10^°*^*', 

 according to how many restrictions on neighbours are admitted [3]. The 

 situation facing us in examining a nucleic acid preparation comprising a 

 large number of isomers or homologues would, then, be comparable to 

 one in which all the words in a dictionary are lined up end to end in a 

 continuous, and essentially arrhythmic and aperiodic, sequence. 



It is quite clear that the first attempt at unravelling such a clutter will 

 have to be based on statistics and that it must limit itself to the description 

 of tendencies or trends of arrangement. To give an example : running to- 

 gether the thirteen words making up the first sentence of King Lear I 

 obtain a monster word of fifty-seven letters of which twenty-one are 

 vowels. On this word a number of determinations can be made : (a) the 

 ratio of consonants to vowels ; {b) the nature of the individual consonants 

 and vowels; (c) the relative frequency of each constituent. If I have a 

 way of removing the vowels without disturbing the rest of the arrange- 

 ment, I shall isolate six solitary consonants, eight pairs of consonants, 

 three bunches of triple consonants and one cluster of five consonants in a 

 row. Each of these units, I would conclude, was originally flanked on 

 both sides by vowels. Other words would yield other combinations, with 

 the unambiguity of distinction increasing with the length of the consonant 

 clusters. In very long words composed of only two vowels and two or 

 three consonants, unique clusters can be expected only very rarely; but 

 the relative frequency of the various combinations of consonants (runs of 

 I, 2, 3, etc.) will be a means of unique differentiation, even though it will 

 not yet make it possible to reconstruct the entire text. 



