Twenty years after the release of the human genome, the genetic “blueprint” of human life, an international research team, including the University of British Columbia’s Chris Overall, has now mapped the first draft sequence of the human proteome.
Their work was published Oct. 16 in Nature Communications and announced today by the Human Proteome Organization (HUPO).
“Today marks a significant milestone in our overall understanding of human life,” says Overall, a professor in the faculty of dentistry and a member of the Centre for Blood Research at UBC. “Whereas the human genome provides a complete ‘blueprint’ of human genes, the human proteome identifies the individual building blocks of life encoded by this blueprint: proteins. “Proteins interact to shape everything from life-threatening diseases to cellular structure in our bodies.”
With 90 percent of the proteins in the human body now mapped, Overall says scientists have a deeper understanding of how individual proteins interact to influence human health, providing insights into disease prevention and individualized medicine.
Their work may have implications for scientists studying potential treatments for COVID-19.
“In COVID-19, for instance, there are two proteomes involved, that of the SARS-CoV-2 virus and that of the infected cells, both of which likely interact with, modify, and change the function of the other,” says Overall.
“Understanding this relationship can shed light on why some cells and individuals are more resilient to COVID-19 and others more vulnerable, providing essential functional information about the human body that genomics alone cannot answer.”
As many human diseases result from changes in the composition or functions of proteins, mapping the proteome strengthens the foundation for disease diagnosis, prediction of outcomes, treatment, and precision medicine.
“Humans share 99.9 percent of their DNA between individuals, yet deficiencies in the proteome ‘parts’ stemming from inherited genetic mutations can lead to genetic diseases, or defective or inadequate immune and cellular responses to environmental, nutritional and infection stressors,” says Overall.
“Knowing which proteins are key to protection from disease, and the deficiencies in expression or activity that are hallmarks of disease, can inform individualized medicine and the development of new therapies.”
The widespread application of whole-exome sequencing (WES) and whole-genome sequencing (WGS) requires interpretation of individual sequence variants. Recommendations exist for variants in genes already associated with a disease (1).
However, when searching for novel gene-to-disease associations or novel disease mechanisms, advanced annotation and interpretation is needed (2).
Complementary methods to prioritize variants based on function or evolutionary properties such as sequence conservation, genetic effects and regulatory element annotations can serve to improve power and ultimately the success of disease studies (3).
To look for sequences conserved across species is an efficient strategy, and many tools were developed to search for evolutionarily conserved elements (4). The underlying assumption is that evolutionarily conserved regions tend to be less tolerant to mutations (5).
Moreover, in the coding region of the genome, a domain-specific score system based on sequence homology is widely used for variants evaluation—SIFT (5), PolyPhen2 (6) or CADD (3, 7).
Protein domains are conserved structural entities, and information about a protein domain may provide clues as to the structure and/or function of the protein studied. Furthermore, it may help to establish evolutionary relationships across protein families (8). Last, but not least, it is useful in the interpretation of mutation studies.
To be able to determine whether specific human genetic variants fall onto annotated conserved protein domains might be helpful for studies of variant mapping in rare diseases, the global evaluation of genomic variation in relationship to protein architecture and the comprehensive evaluation of the population-level proteome as derived from genetically diverse human populations.
Several tools for next-generation sequencing variant annotation, which determine whether specific human genetic variants fall onto annotated conserved protein domains, have recently been published.
For example, Protein Data Bank’s Map Genomic Position to Protein Sequence and 3D Structure (https://www.rcsb.org/pdb/chromosome.do) (9) provides users with multiple views and features. However, this tool is intended solely for a single query (one variant at a time), and to get the information about the protein domain, the user has to take additional steps via the UniProt hyperlink. Furthermore, download and incorporation into own pipeline are not possible.
Additionally, for example, in Ensembl’s Variant Effect Predictor (VEP) (http://grch37.ensembl.org/Homo_sapiens/Tools/VEP) (10), the user can annotate a batch of variants with ‘protein domains’. But to use this tool, the user has to upload the vcf file onto the Ensembl server, which often is not possible, due to security or data privacy reasons. Downloading from the Ensembl VEP is possible, however, and to get the desired information, the user only needs to use multiple tables.
Here we present a tool that can perform multiple queries (batch of 100 variants) in a simple way and give a quick, easily readable answer. Moreover, it is possible to download the entire dataset and integrate this annotation into a local pipeline.
The National Center for Biotechnology Information (NCBI) provides a comprehensive source of the human reference genome assembly (https://www.ncbi.nlm.nih.gov/refseq/) and is an essential resource for genomic, genetic and proteomic research (11).
For proteins annotated on NM_transcript accessions, the information includes the known domain of proteins and their sequences, architecture and cross-species conservation. Though the NCBI resource includes this information, it is only at the protein level. Therefore, in Reference Sequence (RefSeq), there is no connection to genomic coordinates, and its usefulness for variant annotation is limited.
We developed a resource that annotates variants in the coding part of the genome onto conserved protein domains in an easy-to-use way. This could help to make variant interpretation more efficient. The resource thus provides this missing connection for large-scale genome/proteome studies and surpasses prioritization of variants.
Our data are freely accessible through a dedicated open web-based query tool (www.prot2hg.com) or as a downloadable dataset. Moreover, data were incorporated into the GENESIS platform for analysis and matchmaking of exome and genome data from rare diseases (tgp-foundation.org).
Materials and methods
A protein domain is a segment within a protein with a known sequence and a previously described function.
To ultimately map each protein domain to a chromosomal location, a workflow with four key steps was implemented: obtain data from NCBI; process data to create a custom data structure; map protein domain locations to cDNA sequence; and map cDNA sequence location to hg19 genomic location (Figure 1).
- 1. Richards S., Aziz N., Bale S. et al. (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med., 17, 405–424. [PMC free article] [PubMed] [Google Scholar]
- 2. Boycott K.M., Lau L.P., Cutillo C.M. et al. (2019) International collaborative actions and transparency to understand, diagnose, and develop therapies for rare diseases .EMBO Mol. Med., 11, e10486. [PMC free article] [PubMed] [Google Scholar]
- 3. Rentzsch P., Witten D., Cooper G.M. et al. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res., 47, D886–D894. [PMC free article] [PubMed] [Google Scholar]
- 4. Siepel A., Bejerano G., Pedersen J.S. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15, 1034–1050. [PMC free article] [PubMed] [Google Scholar]
- 5. Sim N.L., Kumar P., Hu J. et al. (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res., 40, W452–W457. [PMC free article] [PubMed] [Google Scholar]
- 6. Adzhubei I.A., Schmidt S., Peshkin L. et al. (2010) A method and server for predicting damaging missense mutations. Nat. Methods., 7, 248–249. [PMC free article] [PubMed] [Google Scholar]
- 7. Kircher M., Witten D.M., Jain P. et al. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet., 46, 310–315. [PMC free article] [PubMed] [Google Scholar]
- 8. Sayers E. and Bryant S. (2002) Macromolecular Structure Databases, in The NCBI Handbook [Internet]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK21095/: Bethesda (MD): National Center for Biotechnology Information (US).
- 9. Prlic A., Kalro T., Bhattacharya R. et al. (2016) Integrating genomic information with protein sequence and 3D atomic level structure at the RCSB protein data bank. Bioinformatics, 32, 3833–3835. [PMC free article] [PubMed] [Google Scholar]
- 10. McLaren W., Gil L., Hunt S.E. et al. (2019) The Ensembl Variant Effect Predictor. Genome Biol., 17, 122. [PMC free article] [PubMed] [Google Scholar]
- 11. O’Leary N.A., Wright M.W., Brister J.R. et al. (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res., 44, D733–D745. [PMC free article] [PubMed] [Google Scholar]
More information: Subash Adhikari et al, A high-stringency blueprint of the human proteome, Nature Communications (2020). DOI: 10.1038/s41467-020-19045-9