Technologies for Genome-sequencing Projects 
George M. Church, Ph.D. — Assistant Investigator 
Dr. Church is also Assistant Professor of Genetics at Harvard Medical School. He received his B.A. degree in 
zoology and chemistry from Duke University and his Ph.D. degree in biochemistry and molecular biology 
from Harvard University. Before moving to Harvard Medical School, Dr. Church was a scientist at Biogen 
Research Corporation and a Life Sciences Research Fellow at the University of California, San Francisco. 
THE study of the linear sequence of bases in 
genomic DNA and messenger RNA is steadily 
gaining momentum as a result of the increasing 
ease and advantage of using shared computer da- 
tabases to find connections among distant con- 
cepts and distant biological systems. For exam- 
ple, connections have been found betw^een 
human oncogenes and yeast transcription factors, 
between differentiation antigens and bacterial 
chaperone proteins, and between developmental 
regulatory genes and bacterial DNA-binding 
proteins. 
Unfortunately, these database searches are fre- 
quently unsuccessful, because not all classes of 
genetic elements are represented. The complete 
sequence of a few small genomes should rectify 
this. Sequencing projects have begun for ge- 
nomes of various bacteria, yeasts, a plant, and a 
worm chosen for their well-studied genetics and 
their small genome sizes. Their genus names are 
Mycoplasma, Mycobacteria, Escherichia, Ther- 
mococcus, Arabidopsis, and Caenorhabditis, 
and they represent all major branches of the evo- 
lutionary tree. The genome closest to completion 
is that of Escherichia, with 30 percent of its 4.7 
million base pairs already in the database, 
through the efforts of around 2,000 researchers. 
To improve the accuracy and efficiency of 
these projects, we have developed new sequenc- 
ing technologies. One, called multiplex se- 
quencing, is a way of keeping a large set of DNA 
fragments as a precise mixture throughout most 
of the sequencing steps. Because each mixture 
can be sequenced with the same effort as a single 
sample in previous methods, more fragments can 
be handled. 
The mix is deciphered by strategically tagging 
the fragments at the beginning with unique bits 
of DNA and then, at the end, hybridizing to the 
sequencing reactions complementary bits of DNA 
that have been spread out by size and immobi- 
lized on large membranes. This method also im- 
proves the accuracy, since the mixtures contain 
internal standards of known sequence that help 
in the computer analysis of the film data. 
The number of probings obtainable per mem- 
brane represents the increased efficiency factor- 
of this method. This number exceeds 50 now 
(the higher the better) and is likely to increase. 
We have designed and tested devices to facilitate 
most of the steps in multiplex sequencing, in- 
cluding DNA preparation, sequencing reactions, 
gel loading, hybridization, film exposure, and 
film reading. All of these procedures have been 
applied to collect over 1 million bases of raw data 
and are undergoing further development. Chemi- 
luminescent detection of the multiplex sequence 
images is showing promise as an effective replace- 
ment for the radioactivity normally used. To fill 
in the inevitable last gaps in the sequences, we 
are exploring several approaches, including di- 
rect genomic sequencing (without cloning) from 
the edges of the gaps and isolation of gap-span- 
ning clones by clone hybridization. 
To make the extensive DNA sequences even 
more useful to biological searches, encoded pro- 
teins must be found and collected into families 
based on their interactions or distant relatives. 
For example, we have found matches for about 
half of the genes required by Escherichia coli for 
vitamin biosynthesis and have gathered these 
into families of proteins involved in membrane 
transport, heme and corrin ring methylation, 
amine group transfer, and so on. 
As another example, we have searched for the 
class of the most abundant cellular proteins, 
which have nonetheless eluded the extensive bio- 
chemical and genetics studies of E. coli. This has 
been done by systematically correlating amino- 
terminal protein sequence data obtained from 
two-dimensional gel spots with the DNA se- 
quence and two-dimensional gel databases. Of 
1 30 unique sequences determined so far, over 40 
are candidates for such major novel proteins. 
We are also extending our methods for detect- 
ing in vivo molecular interactions by analyzing 
protection of individual bases from chemical and 
enzymatic methylation. This in vivo footprinting 
method has been extended to a cloning-based as- 
say for DNA-protein interactions. 
In the future, with new sequencing technolo- 
83 
