A new machine-learning method accurately identifies regions of the human genome that have been duplicated or deleted – known as copy number variants – that are often associated with autism and other neurodevelopmental disorders.
The new method, developed by researchers at Penn State, integrates data from several algorithms that attempt to identify copy number variants from exome-sequencing data – high-throughput DNA sequencing of only the protein-coding regions of the human genome.
A paper describing the method, which could help clinicians provide more accurate diagnoses for genetic diseases, appears in the July issue of the journal Genome Research.
“Exome sequencing is fast becoming the gold standard for identifying genetic variations in clinical settings because it is faster and less expensive that other methods,” said Santhosh Girirajan, associate professor of biochemistry and molecular biology at Penn State and the lead author of the paper.
“However, current algorithms for identifying copy number variation from exome sequencing data suffer from very high false-positive rates – many of the variants they identify aren’t actually real.
With our new method, called “CN-Learn,” around 90 percent of the copy number variants we report are real.”
The human genome generally contains two copies of every gene, one on each member of a chromosome pair.
When one cell divides into two, the genome is replicated so that each of the daughter cells gets a full complement of genes, but occasionally errors occur during genome replication that, when present in a sperm or egg cell, can lead to an individual getting more or less than two copies of the gene.
To identify copy number variants from exome-sequencing data, researchers look at the relative amount of DNA sequences produced from each gene.
If there is only one copy of a gene present in an individual, they expect to see fewer sequencing reads than if there are two copies, and three copies of a gene would lead to more reads.
But it’s not quite that simple, because a number of other factors can influence how many sequencing reads are produced from each gene.
Researchers have therefore developed several algorithms to try to correctly identify copy number variants from exome-sequencing data.
Individually, however, these algorithms are not particularly reliable.
“Generally, the high number of false positives from copy-number-variant algorithms has been dealt with by using multiple algorithms and only counting the variants identified by all the methods – like a Venn diagram,” said Vijay Kumar Pounraja, a graduate student at Penn State and first author of the paper.
“This approach has multiple drawbacks and limitations, so we decided to develop a new machine-learning method instead.”
CN-Learn integrates data from four different copy-number-variant algorithms, and uses a small set of biologically validated deletions and duplications to learn the signatures of these genomic events.
This learning process is facilitated by a machine-learning algorithm called Random Forest, which uses hundreds of decision trees to model the relationship between the genetic context of deletions and duplications and the likelihood they are validated.
CN-Learn then uses this model to predict deletions and duplications in other samples without validations.
“Decisions about a patient’s diagnosis and eventual treatment are made based on this information, so it’s incredibly important to get them right,” said Girirajan.
“Because of this, we’ve made CN-Learn and all of the necessary supporting programs available to download in one easy package.”
Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by social impairments, communication difficulties, and restricted and repetitive patterns of behavior.
ASD usually manifests in infants and children and presents a wide range of symptoms that vary from person to person. Currently, 1 in 59 children in the United States are affected, and prevalence rates are expected to increase drastically over the next decade.1
Models exploring the genetic basis of ASD typically focus on protein-coding genes; however, coding sequences account for only 1.5% of human DNA.
The remaining segments of DNA are comprised of noncoding regions, which have been shown to play an important role in many genetic disorders.
For example, recessive mutations in the PTF1A gene enhancer can cause pancreatic agenesis,6 a common mutation in the RET enhancer increases risk for Hirschprung disease,7 and mutations in topologically associating chromatin domains can cause limb malformation.8
Furthermore, a meta-analysis of over a thousand genetic association studies showed that most of the disease-associated single nucleotide variants identified by genome wide association studies (GWAS) lie in the noncoding region.9
However, the contribution of noncoding variants to ASD still remains unclear.
A recent analysis of whole genome sequences of 516 children with ASD and their unaffected family members concluded that individuals with ASD tend to have significantly more de novo mutations in noncoding regions.
The study evaluated two noncoding regions: untranslated regions (UTRs) of genes and conserved transcription factor binding sites that map to sites of DNase I hypersensitivity.10
However, a separate evaluation of the same dataset concluded that although individuals with ASD possessed a small excess of de novo mutations in noncoding regions, there were no significant results across over 50,000 regulatory classes after multiple testing correction.11
As shown by these studies, population genetic analyses typically classify unaffected family members as controls.
However, we hypothesize that this assumption does not effectively elevate variant signal from the genome for ASD cohorts.
Thus, it is possible that family members possess a subclinical phenotype of ASD that may arise from genomic features shared with their affected children. Also, the diagnostic criteria for ASD were modified in 2013 with the release of the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders.
Most parents would have been evaluated using an earlier version of diagnostic criteria, making it possible that some would qualify for an ASD diagnosis by modern clinical standards.
In order to address this issue and to exacerbate signal in the noncoding region, we introduce a separate outgroup of patients with progressive supranuclear palsy (PSP), a neurodegener ative condition that causes difficulty with movement and thought.14
We chose this group of control patients because there is no known etiological overlap or comorbidity between PSP and ASD, and PSP is generally not heritable.
There are some familial cases caused by a mutation in at least one copy of the gene MAPT on chromosome 17, but this is the only gene currently known to be linked with PSP.15
No patients in the control group exhibit symptoms of ASD. In this work, we use whole genome sequencing data from 2182 children with ASD and 379 PSP controls to investigate the role of noncoding variants in ASD susceptibility.
This study focuses on seven major noncoding regions: tissue specific microRNAs, hu man accelerated regions, hypersensitive sites, transcription factor binding sites, DNA repeat sequences, simple repeat sequences, and CpG islands.
Tissue-specific microRNAs play important roles in the regulation of mRNA expression and the development of neurons, and recent studies have implicated a total of 219 microRNAs in the development of ASD.16
Human accelerated regions, which consist of only 49 highly-conserved segments in DNA, have been shown to regulate neural activity, with de novo copy number variations in these regions enriched in individuals with ASD.17
Hypersensitive sites are regulatory regions that are sensitive to cleavage by nucleases, and de novo mutations in these regions are significantly enriched in ASD probands.18
Transcription-factor binding sites are located in the noncoding regions of genes and assist in the regulation of transcription; variants in binding sites in MEGF10 and TCF4 have been associated with ASD and other intellectual disabilities.19,20
DNA Repeat sequences and simple repeat sequences are sequences of repeating base pairs, distinguished by the length of the repeating pattern, that have been linked to neuronal differentiation and brain development.21Finally, CpG islands, which consist of regions with high frequencies of the cytosine and guanine base pairs, can have higher rates of methylation in individuals with ASD.22
Journal information: Genome Research
Provided by Pennsylvania State University