March 7, 2005
Genome centers combine forces to validate
a gene set for biomedical research
By Branwyn Wagman
The advent of online databases to access the human genome has
been a boon to biomedical research, and the usefulness of this
information has just moved to a new level.
Mark Diekhans is the lead researcher on the project from
the UCSC Genome Bioinformatics Group. |
Researchers at UCSC, the European Bioinformatics Institute
(EBI), the National Center for Biotechnology Information (NCBI),
and the Wellcome Trust Sanger Institute (WTSI) in Great Britain
have released the results of a project to identify a core set
of genes that can be located in the human genome and have been
validated as coding for proteins.
After more than a year of work, the collaboration has released
a set of 14,795 genes that can be reliably said to code for
a protein. This gene set, called the Consensus Coding Sequence
(CCDS) set, was posted March 2 on the three major public human
genome browsers: the UCSC Genome
Browser, the Ensembl Browser
at EBI and WTSI, and the NCBI
web site.
The CCDS set is built by consensus among the collaborating
members at UCSC, NCBI, EBI, and WTSI. UCSC's involvement in
this international collaboration is led by David Haussler, professor
of biomolecular engineering and a Howard Hughes Medical Institute
investigator.
"Now that biomedical science has an internationally accepted
human genome reference sequence to work from, it's time to identify
a corresponding reference set of human genes from that genome,"
Haussler said.
The CCDS project addresses the fact that the genes listed in
human genome databases often are not entirely validated, and
the same gene may have different names in different databases.
Since the data characterizing the genes come from a variety
of sources, researchers are not always certain that a listed
gene is real and its stated function is accurate.
The CCDS genes have been given unique identifier and version
numbers to help locate them on genome maps. Each of the genome
browser sites will receive regular updates as the collaboration
continues to refine its knowledge of the protein-coding genes.
Until the Human Genome Project succeeded in sequencing and
assembling the entire human genome, researchers could sequence
the DNA in a gene, but had no way to accurately determine its
location in the genome. Once the genome was sequenced, researchers
began to note which parts of the genome contained known genes,
a process known as genome annotation.
Haussler's group at UCSC pioneered the use of a mathematical
approach known as hidden Markov models as a way to find genes
in DNA sequences using automated computer programs. The technique
is now widely used for this purpose, but Haussler said the problem
of finding all the genes in the DNA sequence of the human genome
has proven to be "much more difficult than we ever imagined."
"It will take the coordinated efforts of experimentalists
and computational biologists many more years to complete this
task," he said.
The CCDS set is calculated following coordinated whole genome
annotation updates carried out by NCBI and Ensembl. Annotation
updates represent genes that are defined by a mixture of manual
curation, carried out by the WTSI Havana team and the NCBI RefSeq
group, and automated computational processing performed by groups
at Ensembl and NCBI.
"Resolving inconsistencies between gene structures generated
by complementary methods of manual curation and automatic annotation
is a major step towards providing stable and accurate annotation
that can be relied on by researchers," said Tim Hubbard,
head of human genome analysis at WTSI.
According to Mark Diekhans, the lead researcher on this project
from the UCSC Genome Bioinformatics Group, inconsistencies arise
because different centers have used different methods to identify
where genes reside. "The names and locations do not always
agree, especially in cases where the gene's function isn't well
understood," Diekhans said.
As a result, the huge gene databases contain genes that appear
to be duplicates or are quite similar, either in their DNA sequences
or in their expression in the living organism. Having the entire
human genome assembled and viewable with online browsers presents
an opportunity for researchers to find each gene sequence in
its correct location on the chromosomes and determine, for example,
if one gene has been given two different names, or if they are
in fact separate genes.
The UCSC contribution to the CCDS project has been mostly quality
control, Diekhans said. "We compared the gene sets postulated
by NCBI and Sanger to find the intersections between them. Then
we applied various bioinformatics approaches to find where the
sequences in this intersecting set might not actually be protein-coding,"
he said.
One approach the UCSC group used was to compare the intersecting
set to a list of likely pseudogenes, elements in the DNA sequence
that appear to be genes but cannot be transcribed to form proteins.
These pseudogenes were predicted by software developed at UCSC
by graduate student Robert Baertsch. Sequences that appeared
to be pseudogenes were removed from consideration for the CCDS
set, and any removed sequence will likely undergo more study
before it is finally accepted as a gene or ruled out.
The UCSC team also compared the intersecting set with analogous
locations on the genomes of other organisms--chimpanzee, chicken,
dog, mouse, rat, and rhesus monkey. These comparisons between
species aid in gene validation. "When gene segments are
conserved across multiple species, it indicates that they are
likely to be real and not pseudogenes," Diekhans said.
The UCSC Genome Browser makes comparative genomics much simpler,
because it allows side-by-side comparisons of analogous genome
segments from various species.
"Comparing human genes to the genes of related species
will be the key to finalizing the human gene set," Haussler
said. "All of biology is the result of evolution. Genes
cannot be fully apprehended outside of their evolutionary context."
The collaborating groups used a conservative process in establishing
the CCDS set. "We were going for high quality and high
confidence," Diekhans said. "When in doubt about a
gene, we left it out of our set. This makes the CCDS a valuable
reference set for disease research."
In addition to Diekhans and Haussler, the UCSC team includes
Adam Siepel, Robert Baertsch, Fan Hsu, Chuck Sugnet, and the
entire UCSC Genome Browser team, led by Jim Kent. Lead researchers
at collaborating institutions include David Lipman, Jim Ostell,
and Kim Pruitt at NCBI; Hubbard, Richard Durbin, Steve Searle,
and Jennifer Ashurst at WTSI; and Ewan Birney at EBI.
"UCSC is very proud to be playing a role in this collaboration
with such outstanding collaborators as NCBI, EBI, and the Wellcome
Trust Sanger Institute," Haussler said.
Return to Front Page