Technologies for Genome-sequencing Projects 
George M. Church, Ph.D. — Assistant Investigator 
Dr. Church is also Assistant Professor of Genetics at Harvard Medical School. He received his B.A. degree in 
zoology and chemistry from Duke University and his Ph.D. degree in biochemistry and molecular biology 
from Harvard University. Before moving to Harvard Medical School, Dr. Church was a scientist at Biogen 
Research Corporation and a Life Sciences Research Foundation Fellow in the Department of Anatomy 
at the University of California, San Francisco. 
THE study of the linear sequence of bases in 
genomic DNA and messenger RNA is steadily 
gaining recognition, in part as a result of the in- 
creasing ease and advantage of using shared com- 
puter databases to find connections among dis- 
tant concepts and distant biological systems. For 
example, connections have been found between 
human oncogenes and yeast transcription factors, 
between differentiation antigens and bacterial 
chaperone proteins, and between developmental 
regulatory genes and bacterial DNA-binding 
proteins. 
Unfortunately, these database searches are fre- 
quently unsuccessful, because not all classes of 
genetic elements are represented. The complete 
sequence of a few small genomes should rectify 
this. Sequencing projects have begun for ge- 
nomes of various bacteria {Mycoplasma, Myco- 
bacteria, Escherichia, and Thermococcus) , a 
yeast (Saccharomyces) , a plant (Arabidopsis) , 
and a worm (Caenorhabditis} , chosen for their 
well-studied genetics, their small genome sizes, 
and their representation of all major branches of 
the evolutionary tree. The genome closest to 
completion is that of Escherichia, with about 38 
percent of its 4.7 million base pairs already in the 
database, through the effort of 2,000 researchers. 
To improve the accuracy and efficiency of 
these projects, we have developed new sequenc- 
ing technologies. One, called multiplex se- 
quencing, is a way of keeping a large set of DNA 
fragments as a precise mixture throughout most 
of the steps of sequencing. Because each mixture 
can be handled with the same effort as a single 
sample in previous methods, more fragments can 
be handled. 
The mix is deciphered by strategically tagging 
the fragments at the beginning with unique bits 
of DNA and then, at the end, hybridizing to the 
sequencing reactions complementary bits of DNA 
that have been spread out by size and immobi- 
lized on large membranes. This method also im- 
proves the accuracy, since the mixtures contain 
internal standards of known sequence that help 
in the computer analysis of the film data. 
The number of probings obtainable per mem- 
brane represents the increased efficiency factor 
of this method. This number exceeds 50 now 
(the higher the better) and is likely to increase. 
We have designed and tested devices to facilitate 
most of the steps in multiplex sequencing, in- 
cluding DNA preparation, sequencing reactions, 
gel loading, hybridization, film exposure, and 
film reading. All of these devices have been ap- 
plied to collect over 1 million bases of raw data 
and are undergoing further development. Multi- 
plexing has also allowed chemiluminescent de- 
tection to replace the radioactivity normally used 
in DNA sequencing, reducing exposure times 1 0- 
fold. To fill in specific gaps in the sequences, we 
have devised multiplex oligonucleotide synthe- 
sis for use in multiplex DNA sequence walking 
strategies. 
Toward the goal of modeling cell structure and 
gene expression, we have searched for abundant 
cellular proteins that have nonetheless eluded 
the extensive biochemical and genetics studies of 
Escherichia coli. This has been done by systemati- 
cally correlating amino-terminal protein se- 
quence data obtained from two-dimensional 
gel spots with the DNA sequence and two- 
dimensional gel databases. Of 300 sequences de- 
termined so far, over 50 are candidates for such 
major novel proteins. 
We have extended our methods for detecting 
in vivo molecular interactions by analyzing the 
protection of individual DNA bases from enzy- 
matic methylation. DNA protein interactions in- 
volved in cAMP and pyrimidine feedback regula- 
tion have been studied in this way. 
In the future, with new sequencing technolo- 
gies such as automated multiplex sequencing, 
with examples of most of the basic genetic mod- 
ules, and with an eye to sequence elements con- 
served among species, the analysis and modeling 
by investigators worldwide of human sequences 
and genetics should become more manageable. 
77 
