February 21, 2000
Computational techniques analyze data from the Human Genome Project
By Tim Stephens
A new discipline has emerged at the intersection of computer science and biotechnology,
bringing the power of advanced computational techniques to bear on complex problems
in molecular biology. Called bioinformatics or computational biology, this new field
is providing essential tools for scientists on the leading edge of research in genetics
and other fundamental areas of biology.
Gene-sequencing efforts such as the Human Genome Project, combined with new techniques
for studying the activity of genes in living cells, are generating enormous amounts
of raw data. These data are accumulating at a rapidly accelerating pace in a variety
of public computer databases, such as those maintained by the National Center for
Biotechnology Information at the National Institutes of Health.
"The driving force behind bioinformatics is the availability of these large
databases and the need to come up with sophisticated computer models for extracting
useful information from them," said David Haussler, professor of computer science
at the University of California, Santa Cruz.
Haussler discussed the use of computational techniques to analyze genetic data in
a talk Saturday (February 19) at the annual meeting of the American Association for
the Advancement of Science in Washington, D.C.
Haussler, who directs UCSC's Center for Biomolecular Science and Engineering, recently joined
the Human Genome Project's bioinformatics team. Bioinformatics is playing an increasingly
important role in the project, an international effort to identify and understand
all of the roughly 100,000 human genes.
"Computer analysis will be an integral part of identifying genes and understanding
their functions," Haussler said.
The set of genetic instructions for making an organism--its genome--is contained
in long, threadlike DNA molecules neatly packaged into chromosomes within the nucleus
of every cell. The sequence of chemical units in the DNA is a kind of code that specifies
the structures of protein molecules, which carry out most of the functions of living
The complete DNA sequence of the human genome, if compiled in books, would fill 200
volumes the size of the Manhattan telephone book. Human Genome Project scientists
are close to having a rough draft of this sequence, but that will only be a first
step. Buried within the genome sequence are the genes--DNA sequences that encode
specific proteins--which ultimately determine all the inherited characteristics of
Locating genes within genomic DNA sequences is one of the first tasks for which scientists
have turned to bioinformatics. Less than 10 percent of the human genome is thought
to comprise protein-coding gene sequences. Interspersed with the genes are control
sequences, which regulate gene activity, and other "noncoding regions"
whose functions are obscure.
Haussler and his coworkers at UC Santa Cruz have developed some of the most effective
computational techniques for finding genes in DNA sequences. They introduced a now
widely used statistical method called hidden Markov modeling to attack this problem.
To analyze the rough draft of the human genome sequence, Haussler is working closely
with researchers at the Massachusetts Institute of Technology's Whitehead Institute.
The Whitehead Institute is one of five major sequencing sites involved in the Human
Working with the rough draft, however, will be a monumentally difficult task, Haussler
said. "The problem is that the rough draft does not provide a continuous DNA
sequence across each chromosome--many regions of the genome are covered only by small
pieces," he said.
The first task Haussler and the Whitehead group are tackling is to line up all of
the segments of the human genome sequenced so far in their proper order and orientations
along the chromosomes. The next step will be to locate genes within the genome sequence.
This will be done in collaboration with Neomorphic, a Berkeley-based genomics company,
using a computer program called Genie.
Genie was initially developed by Haussler's group and researchers at the Lawrence
Berkeley National Laboratory (LBNL). It was exclusively licensed and further developed
by Neomorphic, which was founded by a group of scientists from LBNL, UC Berkeley,
and UCSC. Genie was recently used to identify genes in the genome of the fruit fly,
Drosophila melanogaster, which was sequenced last year. Neomorphic is now developing
a new version of Genie optimized for the rough draft of the human genome sequence.
Research on the genetics of organisms such as Drosophila, yeast, and the roundworm
Caenorhabditis elegans has helped lay the groundwork for studying the much more complex
genome of humans. Many human genes are closely related to genes found in these simpler
organisms, which are widely used as model systems for research in genetics and molecular
biology. Studies of these model organisms have already yielded many valuable insights
into gene functions, normal gene regulation, genetic diseases, and evolutionary processes.
According to Haussler, the role for bioinformatics in this type of research is steadily
increasing as the experimental methods become more sophisticated and complex. DNA
microarrays or "gene chips," for example, provide valuable information
about gene expression--when, where, and to what extent specific genes are active.
This information is critical to understanding a gene's biological function. But gene
chips, like genomic-sequencing technology, produce enormous amounts of data that
can only be analyzed and understood using sophisticated computational approaches.
"There is a lot of information pertaining to gene function that is becoming
available as a result of large-scale experiments using gene chips and other methods,
which generate massive datasets relating to the functions of thousands of genes,"
To analyze these complex datasets, Haussler is pioneering the use of a new statistical
method based on the theory of support vector machines (SVMs). SVMs are able to handle
high-dimensional datasets in which each data point has many features or attributes.
"It's hard to visualize because we live in a three-dimensional world, and we're
talking about analyzing datasets in ten thousand or more dimensions. But we're finding
SVMs extremely useful for gene chip data," Haussler said.
Genomic sequencing and gene chips represent what Haussler calls "high-throughput
genomic technologies," powerful new techniques for understanding molecular biology.
The use of these techniques is increasing, and all of them present significant computational
challenges. One of Haussler's goals is to develop new statistical and algorithmic
methods for integrating these diverse types of genomic data.
For the moment, analyzing the rough draft of the human genome sequence is the
focus of Haussler's efforts. But in the long run, he foresees a happy and prosperous
future for the marriage of computer science and molecular biology. The application
of human genomics to areas such as drug discovery and clinical diagnostics, for example,
will undoubtedly require new computational methodologies, he said.
"Our vision for bioinformatics spans a broad spectrum, from basic molecular
biology all the way up to clinical diagnostics," Haussler said.
Additional information about the research program is available on Haussler's
Return to Front Page