4 November 2010. Since the completion of the human genome project 10 years ago, scientists have hoped that knowledge of human genetics would provide new clues as to why some people get diseases that others do not. Genomewide association studies (GWAS) seek to answer this question by looking for associations between diseases and single nucleotide polymorphisms (SNPs) in the DNA. GWAS have indeed turned up hundreds of susceptibility variants, and yet they don’t begin to account for the full inherited risk of many disorders. To help crack this problem, three years ago an international consortium called the 1000 Genomes Project came together with the goal of assembling a more thorough catalog of human genetic variation. Now the group has released the results of its pilot project in the October 28 Nature. To date, the effort has more than doubled the number of known genetic variants, including not just SNPs but structural changes in the DNA as well, said lead author Richard Durbin of the Wellcome Trust Sanger Institute in Cambridge, U.K. Even as the project gears up to full speed, the early data is already being used by other scientists for genetic research. In a related paper in the October 29 Science, researchers led by Evan Eichler at the University of Washington in Seattle used an analytic method developed in their lab to mine the 1000 Genomes data. Eichler and colleagues pinned down the copy number of genes in repetitive regions of the genome that have been inaccessible to traditional genetic methods. These locales hold particular significance because they are prone to genetic changes, and therefore may play critical roles in disease and in evolution. Intriguingly, Eichler and colleagues found that numerous genes in these duplicated regions are involved in brain development. Together, these new genetic approaches hold promise for opening up novel genetic variants for study and perhaps for finding new links to disease.
“These publications are landmark events in our appreciation of our own biology,” wrote John Hardy at University College London, U.K., in an e-mail to ARF. “[They] mark an enormous resource for biomedical research.” (See full comment below.)
The original Human Genome Project only scratched the surface of the tremendous genetic diversity in people. On average, about one base in every 1,000 of a person’s DNA varies compared to any other person’s DNA, and also to the reference human genome, said David Altshuler of Massachusetts General Hospital in Boston, a coauthor on the Nature paper, during a joint Science/Nature press teleconference. That means that each person carries about three million SNPs, out of three billion bases of total DNA. Ten years ago when the reference genome was completed, Altshuler said, about 5 percent of these genetic polymorphisms were in the public database. Later genetic initiatives such as HapMap brought the number close to 50 percent (see ARF related news story). Advances in high-throughput sequencing have now made it economically viable to sequence the entire genomes of thousands of people, and this should reveal almost all genetic variants. By whole-genome sequencing of a few hundred people from several distinct ethnic groups, the pilot phase of the 1000 Genomes Project has already catalogued about 95 percent of human genetic polymorphisms, Altshuler said. He said the scientists expect to bring that number up to 98 percent or higher in the full project, which will sequence the genomes of 2,500 people across multiple populations. The remaining 1 or 2 percent of variation in each person will never be in a catalog, Altshuler said, because it probably represents mutations that are unique to that individual or family group.
In the pilot project, the scientists compared three different approaches to genomewide sequencing in order to determine the best strategy. In one method, the researchers produced high-coverage sequence from two mother-father-child trios; in another, they performed low-coverage sequencing of 179 people from four populations; and in the last, they sequenced only the exons (the parts of genes that are actually made into proteins) of 697 people from seven populations. The researchers concluded that low-coverage sequencing finds genetic variation most efficiently, especially when complemented with deep, targeted sequencing of exons to find rare variants in the regions of highest interest. Durbin said they will take these two methods forward into the full-scale project. The group identified about 15 million single nucleotide polymorphisms, more than twice the number known before, Durbin said. They also found one million short insertions and deletions, and 20,000 structural variants, most of which had been unknown. Ominously for everyone, perhaps, the pilot project also showed that each person carries 250 to 300 defective genes, as well as another 50 to 100 variants that have been implicated in disease risk. It is not yet clear if having a single copy of a non-functional gene has health consequences, Altshuler said; only experiments will answer this question.
Altshuler said he expects the 1000 Genomes data, all of which are publicly available, to empower genomewide association studies that look for relationships between genetic polymorphisms and disease. Currently, GWAS can only look at fairly common genetic variants, those present in the population at frequencies of 5 to 10 percent or more, Altshuler said, but the new data will enable scientists to examine much rarer genetic variants through a technique known as “imputation.” As explained in an accompanying News & Views article by Rasmus Nielsen of the University of California in Berkeley, researchers doing GWAS will not need to collect as much sequence data, because they can use the whole-genome sequences in the 1000 Genomes database to infer the identity of missing bases in the GWAS data. In other words, since nearby genetic variants are often co-inherited, if you find one, you can infer that the other is also present. “Skeptics may find this notion—using the data from some individuals to ‘invent’ data for others—alarming,” Nielsen wrote. But if done correctly, imputation “...can significantly increase the statistical power of GWAS.” Imputation could allow GWAS to examine low-frequency genetic variants without raising the already high costs of such studies.
But what about rare genetic mutations present only in a particular family group? These variants will not be found in a genetic catalog derived from the general population, Altshuler said. They greatly interest researchers, however, because they are often responsible for inherited diseases. Altshuler said that the data from the 1000 Genomes Project will be equally useful for uncovering these rare familial variants. Researchers will first sequence the genomes of the affected family to find all the variants in their DNA. The vast majority of those variants will also be in the common human catalog, and can be eliminated from consideration. This will leave only a handful of candidates that might be the rare mutations of interest, Altshuler said. The 1000 Genomes data have already been used in this way by researchers at the University of Washington, Altshuler said, who were able to rapidly home in on a disease gene in an affected family. He added, “Already the earliest data from the 1000 Genomes Project have been used in multiple published papers.”
Altshuler also noted a caution. “We want to be careful not to suggest that this framework project is in itself medical research, because it is simply a foundational tool. But we do believe that in the long run, this is a very valuable approach to learn new things about the basis of disease.”
The Science paper is one example of how scientists will use the 1000 Genomes data. First authors Peter Sudmant and Jacob Kitzman in Eichler’s group at the University of Washington used a computational algorithm called “mrFAST” (see Alkan et al., 2009) to reanalyze the genetic data. They looked specifically for regions of the genome that have been repeatedly duplicated. Most algorithms try to map SNPs to unique locations in the genome, Eichler explained, but mrFAST maps DNA sequences to all of their possible locations. This gives a good estimate of how many times that piece of DNA appears along the chromosomes, or in other words, of the copy number of that piece.
Copy number variants are a particularly intriguing form of genetic variation, because they are associated with disease risk, particularly in neurology (see ARF related news story and Itsara et al., 2009). Most copy number variants are found in duplicated regions of the human genome, and have been difficult to study with traditional methods because they occur in long, repetitive sequences that make it difficult to analyze sequence data. Eichler estimates that about 15 percent of the genome, including about 1,000 genes, are in these regions and have been inaccessible to genetic research until now. Researchers at the University of Toronto have developed their own algorithm for finding copy number variants (see Medvedev et al., 2010).
One reason why gene copy number may cause disease, Eichler said, is because duplicated sequences promote rearrangements of the sequence around them. “They create a bad neighborhood in the genome, where pieces of DNA can be gained and lost at a higher frequency. We find these areas rearranged in children with autism and intellectual disability more often than in normal children.” Eichler pointed out that these unstable, highly duplicated regions of the genome are loaded with genes that are important in the brain. The instability of these regions may explain why copy numbers vary greatly among people, and are found in different numbers in distinct population groups. “You can think of these almost as accordions of the genome, expanding and contracting in terms of their copy number,” Eichler said.
One intriguing finding of the paper was that many genes with high copy number are multiplied specifically in humans, but not in our nearest relatives, the great apes. Among these genes, Eichler said, “we find a particularly tantalizing set of genes that are important in terms of neural development and neuronal migration.” The list includes dopamine receptors, genes involved in setting up the right and left hemispheres of the brain, and genes involved in neuronal arborization and social cognition. It is tempting to speculate that these gene duplications may have played a role in the evolution of the human brain, Eichler said, but this question can only be answered with further research.
The new information on copy number has unexpectedly revealed some limitations in the reference human genome sequence. More than 100 areas that are listed as diploid in the reference sequence are actually present in multiple copies in most people, the analysis found. Eichler said his group is working with the Genome Reference Consortium to find these extra gene copies and integrate the information into the genome data, in a project funded by the National Human Genome Research Institute.
These next-generation ventures have opened up a new class of genes for genetic research, Eichler said. His group plans to do association studies on high copy number genes to see if they correlate with disease risk, as well as to investigate other properties of these unstudied genes such as expression levels, methylation status, and other chromatin changes. “The veil has been lifted on a whole new level of genetic diversity,” Eichler said. “I believe that this is the era in which we’re going to make huge inroads in understanding the genetic basis of human disease.”—Madolyn Bowman Rogers.
1000 Genomes Project Consortium, Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA, Collaborators (50). A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. Abstract
Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, Sampas N, Bruhn L, Shendure J, 1000 Genomes Project, Eichler EE. Diversity of Human Copy Number Variation and Multicopy Genes. Science. 2010 Oct 29;330:641-6. Abstract
Nielsen R. In search of rare human variants. Nature. 2010 Oct 28;467(7319):1050-1. Abstract