19 Aug 2016

The human genome consists of more than 3 billion nucleotides, but only 1 to 2 percent actually code for proteins. That protein-coding subset, called the exome, holds the vast majority of mutations that cause severe disease, and screening people’s exomes is far cheaper than sequencing whole genomes. In the August 18 Nature, scientists led by Daniel MacArthur of Massachusetts General Hospital in Boston summarize findings from the largest compilation of exome data yet—the Exome Aggregation Consortium (ExAC). It pulls together sequences from 60,706 donors from 14 studies worldwide. ExAC, which has been running for more than three years, aims to assemble a massive, publicly available database that captures variation in the general human population. “We need databases of normal variation to tell us which changes that we find in a patient are also seen in healthy people,” MacArthur said at a Nature press briefing. “That can be extremely important in identifying which of those genetic changes is actually causal for that patient’s disease.”

Currently available public databases give a somewhat limited view of genetic variation in humans. The 1000 Genomes Project (1000G) offers shallow whole-genome sequences (meaning each nucleotide was sequenced only a few times) as well as deeper exome data, for 2,504 people from all over the world (1000 Genomes Consortium, 2015). The Exome Sequencing Project pools data on the protein-coding regions from four times as many people, though it is limited mostly to those of European-American and African-American ancestry (Fu et al., 2013). ExAC expands on that sample size by an order of magnitude and includes more diversity, adding people of Latino, South Asian, and East Asian descent. As this public resource has grown, researchers have already used it more than 5 million times, mostly to look up genetic variants they find in patients to see how common they are across the general population. Those that are very rare in the general population could be good candidates for pathology, while common variants are less likely to cause disease. “Virtually all clinical diagnostic labs now use the ExAC resource as their standard reference database for diagnosis of rare disease,” said MacArthur. As its sample size grows, it will become even more valuable, he said. Researchers are also turning to exome analysis to look for rare variants that cause Alzheimer’s and Parkinson’s diseases (Bras et al., 2016; Apr 2015 news; Jun 2016 news).

“This may be the deepest dive into the well of human genetic variation so far,” wrote Jay Shendure, University of Washington, Seattle, in an accompanying News and Views editorial. “There is little doubt that ExAC will both refine and accelerate Mendelian-gene discovery and clinical genetics.”

To develop this database, first author Monkol Lek and colleagues assembled nearly a petabyte of raw sequencing data—that’s 1,000 terabytes, or as MacArthur puts it, 4,000 laptops worth of data. They used software developed at the Broad Research Institute of MIT and Harvard to process it all and found 10 million spots on the exome that vary. They considered 7.5 million of them high quality due to their extensive sequencing depth. This equates to one variant every eight bases. Half were so rare that they occurred only once in the dataset, and nearly three-quarters had never been seen before.

Variation across exomes was not uniform. More than 3,000 genes carried fewer variants than would be expected by chance, implying that these genes are less tolerant to variation and therefore highly important for human biology. Just a quarter of these 3,230 genes have been linked to disease. “This helps us to identify genes that are likely to be involved in human diseases,” said MacArthur.

How well different genes handle genetic variation was the subject of a companion paper appearing August 17 in Nature Genetics, where first author Douglas Ruderfer, working with Menachem Fromer and Shaun Purcell at the Icahn School of Medicine at Mount Sinai, New York, used the ExAC database to look more closely at copy number variation (CNV)—deletions or duplications of coding sequence. Genes previously reported to be evolutionarily conserved and less tolerant of single nucleotide genetic changes were likewise found to have fewer CNVs. This hints that these genes are highly important. The authors found that those highly expressed in the brain were particularly intolerant of CNVs. By contrast, those highly expressed in the liver, pancreas, and duodenum carried more CNVs, suggesting they are more flexible when it comes to gene dosage.

At the same time, ExAC is forcing scientists to rethink some variants that were previously implicated in rare diseases. More than 100 of these variants occurred in at least 1 percent of the general population, making it implausible that the variants and rare disease are linked. This illustrates how ExAC can help correct errors that have crept into genetic databases, said MacArthur. For example, geneticists have struggled to determine the pathogenicity of some variations in genes that cause AD and other neurodegenerative diseases (see Jul 2012 Alzforum webinar).

Another companion paper, appearing August 17 in Genetics in Medicine, looks at this issue for various forms of cardiomyopathy. First authors Roddy Walsh and Kate Thompson, working with senior authors Stuart Cook, Imperial College London, and Hugh Watkins, University of Oxford, U.K., respectively, found 7 to 14 percent of variants previously suggested to cause cardiomyopathy are common in healthy controls, implying they do not cause disease either. Most of these genes came to be associated with disease through candidate gene studies. By contrast, genes pulled from family linkage studies, such as sarcomere genes for hypertrophic cardiomyopathy, were validated in ExAC.

How ExAC will be useful in neurodegenerative disease research is less clear-cut, MacArthur conceded. “Genetically complex, late-onset diseases are extremely difficult to study genetically,” he told Alzforum. However, even for those types of disease, large sets of control individuals can be used to study which variants are more common in the disease vs. the general population, he said. “ExAC individuals are not guaranteed to be free of Alzheimer’s, but they should be broadly representative of the general population, so they do serve as a reasonable control set for these types of large-scale analyses.”

Another snag for research on neurodegeneration is the lack of phenotypic information for individuals in ExAC, wrote Rita Guerreiro, University College London, to Alzforum. “Often researchers end up in a situation where a variant of interest in a novel gene is also present in one or two samples in ExAC,” she said. Those ExAC variants could occur in young, asymptomatic people destined to develop AD, she said. For Alzheimer’s disease and other late-onset neurodegenerative disorders, an ideal reference data set of genetic variability would include only healthy individuals of old age with no changes in their brains, wrote Guerreiro, who was not involved with this work. Together with John Hardy at UCL, she is building a database of healthy exomes from older adults that will help researchers differentiate benign variants from those that cause AD or other dementias. Nevertheless, Guerreiro wrote that ExAC is a “truly unparalleled resource” that has spurred genetic research.

Rudolph Tanzi, Massachusetts General Hospital, Charlestown, agreed. “Access to this new database should be transformative for the field,” he wrote in an email. He noted that the significant number of variants that appear in single individuals are still tentative and will need to be validated.—Gwyneth Dickey Zakaib

Comments

No Available Comments

Make a Comment

To make a comment you must login or register.

References

News Citations

Webinar Citations

Paper Citations

1000 Genomes Project Consortium, Corresponding authors, Steering committee, Production group, Baylor College of Medicine, BGI-Shenzhen, Broad Institute of MIT and Harvard, Coriell Institute for Medical Research, European Molecular Biology Laboratory, European Bioinformatics Institute, Illumina, Max Planck Institute for Molecular Genetics, McDonnell Genome Institute at Washington University, US National Institutes of Health, University of Oxford, Wellcome Trust Sanger Institute, Analysis group, Affymetrix, Albert Einstein College of Medicine, Baylor College of Medicine, BGI-Shenzhen, Bilkent University, Boston College, Broad Institute of MIT and Harvard, Cold Spring Harbor Laboratory, Cornell University, European Molecular Biology Laboratory, European Molecular Biology Laboratory, European Bioinformatics Institute, Harvard University, Human Gene Mutation Database, Illumina, Icahn School of Medicine at Mount Sinai, Louisiana State University, Massachusetts General Hospital, Max Planck Institute for Molecular Genetics, McDonnell Genome Institute at Washington University, McGill University, National Eye Institute, NIH, New York Genome Center, Ontario Institute for Cancer Research, Pennsylvania State University, Rutgers Cancer Institute of New Jersey, Stanford University, Tel-Aviv University, Jackson Laboratory for Genomic Medicine, Thermo Fisher Scientific, Translational Genomics Research Institute, US National Institutes of Health, University of California, San Diego, University of California, San Francisco, University of California, Santa Cruz, University of Chicago, University College London, University of Geneva, University of Maryland School of Medicine, University of Michigan, University of Montréal, University of North Carolina at Chapel Hill, University of North Carolina at Charlotte, University of Oxford, University of Puerto Rico, University of Texas Health Sciences Center at Houston, University of Utah, University of Washington, Weill Cornell Medical College, Wellcome Trust Sanger Institute, Yale University, Structural variation group, BGI-Shenzhen, Bilkent University, Boston College, Broad Institute of MIT and Harvard, Cold Spring Harbor Laboratory, Cornell University, European Molecular Biology Laboratory, European Molecular Biology Laboratory, European Bioinformatics Institute, Illumina, Leiden University Medical Center, Louisiana State University, McDonnell Genome Institute at Washington University, Stanford University, Jackson Laboratory for Genomic Medicine, Translational Genomics Research Institute, US National Institutes of Health, University of California, San Diego, University of Maryland School of Medicine, University of Michigan, University of North Carolina at Charlotte, University of Oxford, University of Texas MD Anderson Cancer Center, University of Utah, University of Washington, Vanderbilt University School of Medicine, Weill Cornell Medical College, Wellcome Trust Sanger Institute, Yale University, Exome group, Baylor College of Medicine, BGI-Shenzhen, Boston College, Broad Institute of MIT and Harvard, Cornell University, European Molecular Biology Laboratory, European Bioinformatics Institute, Massachusetts General Hospital, McDonnell Genome Institute at Washington University, McGill University, Stanford University, Translational Genomics Research Institute, US National Institutes of Health, University of Geneva, University of Michigan, University of Oxford, Yale University, Functional interpretation group, Cornell University, European Molecular Biology Laboratory, European Bioinformatics Institute, Harvard University, Stanford University, Weill Cornell Medical College, Wellcome Trust Sanger Institute, Yale University, Chromosome Y group, Albert Einstein College of Medicine, American Museum of Natural History, Arizona State University, Boston College, Broad Institute of MIT and Harvard, Cornell University, European Molecular Biology Laboratory, European Bioinformatics Institute, New York Genome Center, Stanford University, Jackson Laboratory for Genomic Medicine, University of Michigan, University of Queensland, Virginia Bioinformatics Institute, Wellcome Trust Sanger Institute, Data coordination center group, Baylor College of Medicine, BGI-Shenzhen, Broad Institute of MIT and Harvard, European Molecular Biology Laboratory, European Bioinformatics Institute, Illumina, Max Planck Institute for Molecular Genetics, McDonnell Genome Institute at Washington University, Translational Genomics Research Institute, US National Institutes of Health, University of California, Santa Cruz, University of Michigan, University of Oxford, Wellcome Trust Sanger Institute, Samples and ELSI group, Sample collection, British from England and Scotland (GBR), Colombians in Medellín, Colombia (CLM), Han Chinese South (CHS), Finnish in Finland (FIN), Iberian Populations in Spain (IBS), Puerto Ricans in Puerto Rico (PUR), African Caribbean in Barbados (ACB), Bengali in Bangladesh (BEB), Chinese Dai in Xishuangbanna, China (CDX), Esan in Nigeria (ESN), Gambian in Western Division – Mandinka (GWD), Indian Telugu in the UK (ITU) and Sri Lankan Tamil in the UK (STU), Kinh in Ho Chi Minh City, Vietnam (KHV), Mende in Sierra Leone (MSL), Peruvian in Lima, Peru (PEL), Punjabi in Lahore, Pakistan (PJL), Scientific management, Writing group, European Molecular Biology Laboratory European Bioinformatics Institute, National Eye Institute NIH, University of California San Diego, University of California San Francisco, University of California Santa Cruz, British from England and Scotland GBR, Colombians in Medellín Colombia CLM, Han Chinese South CHS, Finnish in Finland FIN, Iberian Populations in Spain IBS, Puerto Ricans in Puerto Rico PUR, African Caribbean in Barbados ACB, Bengali in Bangladesh BEB, Chinese Dai in Xishuangbanna China CDX, Esan in Nigeria ESN, Gambian in Western Division - Mandinka GWD, Indian Telugu in the UK ITU and Sri Lankan Tamil in the UK STU, Kinh in Ho Chi Minh City Vietnam KHV, Mende in Sierra Leone MSL, Peruvian in Lima Peru PEL, Punjabi in Lahore Pakistan PJL. A global reference for human genetic variation. Nature. 2015 Sep 30;526(7571):68-74. PubMed.
Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ, Altshuler D, Shendure J, Nickerson DA, Bamshad MJ, NHLBI Exome Sequencing Project, Akey JM. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013 Jan 10;493(7431):216-20. Epub 2012 Nov 28 PubMed.
Bras J, Djaldetti R, Alves AM, Mead S, Darwent L, Lleo A, Molinuevo JL, Blesa R, Singleton A, Hardy J, Clarimon J, Guerreiro R. Exome sequencing in a consanguineous family clinically diagnosed with early-onset Alzheimer's disease identifies a homozygous CTSF mutation. Neurobiol Aging. 2016 Oct;46:236.e1-6. Epub 2016 Jul 4 PubMed.

Flood of Exomes Brings Genetic Variation into Focus

Quick Links

Tools

Comments

Make a Comment

References

News Citations

Webinar Citations

Paper Citations

External Citations

Further Reading

Papers

News

Primary Papers

Annotate