DNA sequence now available, 07-03-00

UCSC Currents online

July 3, 2000

Working draft of human genome sequence now publicly available on UCSC web site

By Tim Stephens

UCSC researchers who performed the computer analysis to assemble a working draft of the human genome sequence have now posted their results on a UCSC web site. Biomedical researchers throughout the world can now search the working draft for particular genes or DNA sequences of interest to them.

A group led by David Haussler, professor of computer science and director of the Center for Biomolecular Science and Engineering, created a powerful new computer program to assemble the working draft from the sequence data obtained by the international Human Genome Project (see related story). The policy of the public consortium of scientists working on the Human Genome Project has been to release sequence information to the world as soon as possible. Until the working draft was assembled, however, the sequence data were only available in many small pieces.

"This is the first publicly available view of the human genome sequence tentatively placed in the order and orientation in which we think it occurs along the human chromosomes," Haussler said.

Haussler noted that after a 10-year effort involving many laboratories and more than 1,000 scientists, the human genome can now be downloaded from the UCSC web site in about an hour and a half by anyone with a DSL-speed Internet connection. The genome sequence is essentially a long string of As, Ts, Cs, and Gs, representing the chemical units of DNA, called bases. The human genome consists of approximately 3.1 billion bases arrayed along the length of the chromosomes.

Scientists involved in the Human Genome Project have sequenced about 85 percent of the human genome and continue to generate new sequence data at a rapid pace. Haussler's group will rerun their computer analysis every few weeks, incorporating new data so that biomedical researchers will have immediate access to the most up-to-date assembly.

Although the current working draft still has some gaps and uncertainty in it, it is already extremely useful for most biomedical research purposes, said Alan Zahler, an associate professor of biology at UCSC. In many cases, researchers have identified part of a gene or have other clues to the gene's sequence. They can now use that information to search the genome for the rest of the sequence associated with the gene they are interested in.

In addition, researchers who know the sequence of one gene can search the genome for similar sequences to find related genes. Many genes are members of multigene families with similar and sometimes overlapping functions, Zahler said.

"Identification of all of the members of a gene family will give us a sense for how many genes with a certain role are present in the genome," Zahler said. "Before, it took long periods of experimentation to find out whether a gene in humans was a member of a larger family of genes."

As a test for completeness of the working draft, Human Genome Project scientists searched the draft for known genes associated with human genetic diseases and found 95 percent of those diseases had identifiable genes in the working draft.

"The chances of finding a particular disease gene in the working draft are apparently quite good," Haussler said. "Technically, the working draft only covers about 85 percent of the genome, but in practice it appears to cover 95 percent of the disease genes."

The working draft is also an exciting beginning for scientists interested in understanding gene structure and organization in humans, Zahler noted.

"For the first time, we will be able to look at tens of thousands of genes at once and start to search for common themes in areas such as how classes of genes are turned on and off, how the information in genes is processed into a form that encodes proteins, and how that processing is regulated," he said.

Jim Kent, a graduate student working with Zahler, designed and wrote most of the software used to assemble the working draft, which was compiled from hundreds of thousands of fragments of various sizes.

"Imagine you have five copies of War and Peace and one of Crime and Punishment, you put them through a paper shredder, and then try to paste together a single copy of War and Peace from the shreds," Kent said. "That job would be a lot like assembling the human genome, except that the genome runs to about a million pages."

Haussler said Kent's accomplishments will have a very real impact on science and medicine. "He has shown enormous talent and creativity in tackling this fundamental problem," Haussler said.

In addition to the UCSC web site, the working draft will also be available on sites maintained by the National Center for Biotechnology Information and the European Bioinformatics Institute. Both NCBI and EBI are major contributors to the computational analysis of the human genome data. Haussler has already sent them the current working draft and will continue to send them updated versions as new sequence data are incorporated into the analysis.

Return to Front Page