The human genome consists of more than 3 billion nucleotides, but only 1 to 2 percent actually code for proteins. That protein-coding subset, called the exome, holds the vast majority of mutations that cause severe disease, and screening people’s exomes is far cheaper than sequencing whole genomes. In the August 18 Nature, scientists led by Daniel MacArthur of Massachusetts General Hospital in Boston summarize findings from the largest compilation of exome data yet—the Exome Aggregation Consortium (ExAC). It pulls together sequences from 60,706 donors from 14 studies worldwide. ExAC, which has been running for more than three years, aims to assemble a massive, publicly available database that captures variation in the general human population. “We need databases of normal variation to tell us which changes that we find in a patient are also seen in healthy people,” MacArthur said at a Nature press briefing. “That can be extremely important in identifying which of those genetic changes is actually causal for that patient’s disease.”

Currently available public databases give a somewhat limited view of genetic variation in humans. The 1000 Genomes Project (1000G) offers shallow whole-genome sequences (meaning each nucleotide was sequenced only a few times) as well as deeper exome data, for 2,504 people from all over the world (1000 Genomes Consortium, 2015). The Exome Sequencing Project pools data on the protein-coding regions from four times as many people, though it is limited mostly to those of European-American and African-American ancestry (Fu et al., 2013). ExAC expands on that sample size by an order of magnitude and includes more diversity, adding people of Latino, South Asian, and East Asian descent. As this public resource has grown, researchers have already used it more than 5 million times, mostly to look up genetic variants they find in patients to see how common they are across the general population. Those that are very rare in the general population could be good candidates for pathology, while common variants are less likely to cause disease. “Virtually all clinical diagnostic labs now use the ExAC resource as their standard reference database for diagnosis of rare disease,” said MacArthur. As its sample size grows, it will become even more valuable, he said. Researchers are also turning to exome analysis to look for rare variants that cause Alzheimer’s and Parkinson’s diseases (Bras et al., 2016; Apr 2015 newsJun 2016 news). 

“This may be the deepest dive into the well of human genetic variation so far,” wrote Jay Shendure, University of Washington, Seattle, in an accompanying News and Views editorial. “There is little doubt that ExAC will both refine and accelerate Mendelian-gene discovery and clinical genetics.”

To develop this database, first author Monkol Lek and colleagues assembled nearly a petabyte of raw sequencing data—that’s 1,000 terabytes, or as MacArthur puts it, 4,000 laptops worth of data. They used software developed at the Broad Research Institute of MIT and Harvard to process it all and found 10 million spots on the exome that vary. They considered 7.5 million of them high quality due to their extensive sequencing depth. This equates to one variant every eight bases. Half were so rare that they occurred only once in the dataset, and nearly three-quarters had never been seen before.

Variation across exomes was not uniform. More than 3,000 genes carried fewer variants than would be expected by chance, implying that these genes are less tolerant to variation and therefore highly important for human biology. Just a quarter of these 3,230 genes have been linked to disease. “This helps us to identify genes that are likely to be involved in human diseases,” said MacArthur.

How well different genes handle genetic variation was the subject of a companion paper appearing August 17 in Nature Genetics, where first author Douglas Ruderfer, working with Menachem Fromer and Shaun Purcell at the Icahn School of Medicine at Mount Sinai, New York, used the ExAC database to look more closely at copy number variation (CNV)—deletions or duplications of coding sequence. Genes previously reported to be evolutionarily conserved and less tolerant of single nucleotide genetic changes were likewise found to have fewer CNVs. This hints that these genes are highly important. The authors found that those highly expressed in the brain were particularly intolerant of CNVs. By contrast, those highly expressed in the liver, pancreas, and duodenum carried more CNVs, suggesting they are more flexible when it comes to gene dosage.

At the same time, ExAC is forcing scientists to rethink some variants that were previously implicated in rare diseases. More than 100 of these variants occurred in at least 1 percent of the general population, making it implausible that the variants and rare disease are linked. This illustrates how ExAC can help correct errors that have crept into genetic databases, said MacArthur. For example, geneticists have struggled to determine the pathogenicity of some variations in genes that cause AD and other neurodegenerative diseases (see Jul 2012 Alzforum webinar). 

Another companion paper, appearing August 17 in Genetics in Medicine, looks at this issue for various forms of cardiomyopathy. First authors Roddy Walsh and Kate Thompson, working with senior authors Stuart Cook, Imperial College London, and Hugh Watkins, University of Oxford, U.K., respectively, found 7 to 14 percent of variants previously suggested to cause cardiomyopathy are common in healthy controls, implying they do not cause disease either. Most of these genes came to be associated with disease through candidate gene studies. By contrast, genes pulled from family linkage studies, such as sarcomere genes for hypertrophic cardiomyopathy, were validated in ExAC.

How ExAC will be useful in neurodegenerative disease research is less clear-cut, MacArthur conceded. “Genetically complex, late-onset diseases are extremely difficult to study genetically,” he told Alzforum. However, even for those types of disease, large sets of control individuals can be used to study which variants are more common in the disease vs. the general population, he said. “ExAC individuals are not guaranteed to be free of Alzheimer’s, but they should be broadly representative of the general population, so they do serve as a reasonable control set for these types of large-scale analyses.”  

Another snag for research on neurodegeneration is the lack of phenotypic information for individuals in ExAC, wrote Rita Guerreiro, University College London, to Alzforum. “Often researchers end up in a situation where a variant of interest in a novel gene is also present in one or two samples in ExAC,” she said. Those ExAC variants could occur in young, asymptomatic people destined to develop AD, she said. For Alzheimer’s disease and other late-onset neurodegenerative disorders, an ideal reference data set of genetic variability would include only healthy individuals of old age with no changes in their brains, wrote Guerreiro, who was not involved with this work. Together with John Hardy at UCL, she is building a database of healthy exomes from older adults that will help researchers differentiate benign variants from those that cause AD or other dementias. Nevertheless, Guerreiro wrote that ExAC is a “truly unparalleled resource” that has spurred genetic research.

Rudolph Tanzi, Massachusetts General Hospital, Charlestown, agreed. “Access to this new database should be transformative for the field,” he wrote in an email. He noted that the significant number of variants that appear in single individuals are still tentative and will need to be validated.—Gwyneth Dickey Zakaib

Comments

No Available Comments

Make a Comment

To make a comment you must login or register.

References

News Citations

  1. The PLD3 Gene: Alzheimer's Risk Factor or False Alarm?
  2. A New Gene for Familial Parkinson’s?

Webinar Citations

  1. Weeding Mendel’s Garden: Can We Hoe Dubious Genetic Associations?

Paper Citations

  1. . A global reference for human genetic variation. Nature. 2015 Sep 30;526(7571):68-74. PubMed.
  2. . Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013 Jan 10;493(7431):216-20. Epub 2012 Nov 28 PubMed.
  3. . Exome sequencing in a consanguineous family clinically diagnosed with early-onset Alzheimer's disease identifies a homozygous CTSF mutation. Neurobiol Aging. 2016 Oct;46:236.e1-6. Epub 2016 Jul 4 PubMed.

External Citations

  1. ExAC

Further Reading

Papers

  1. . ABCA7 p.G215S as potential protective factor for Alzheimer's disease. Neurobiol Aging. 2016 Oct;46:235.e1-9. Epub 2016 Apr 20 PubMed.
  2. . The Evolution of Genetics: Alzheimer's and Parkinson's Diseases. Neuron. 2016 Jun 15;90(6):1154-63. PubMed.
  3. . Next-generation sequencing reveals substantial genetic contribution to dementia with Lewy bodies. Neurobiol Dis. 2016 Oct;94:55-62. Epub 2016 Jun 14 PubMed.
  4. . ABCA7 frameshift deletion associated with Alzheimer disease in African Americans. Neurol Genet. 2016 Jun;2(3):e79. Epub 2016 May 17 PubMed.

Primary Papers

  1. . Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285-291. PubMed.
  2. . Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat Genet. 2016 Aug 17; PubMed.
  3. . Reassessment of Mendelian gene pathogenicity using 7,855 cardiomyopathy cases and 60,706 reference samples. Genet Med. 2017 Feb;19(2):192-203. Epub 2016 Aug 17 PubMed.