A new SARS-CoV-2 mutation cancels 81 letters from the genome – what will be the consequences?


As the coronavirus pandemic has swept across the U.S., in addition to tracking the number of COVID daily cases, there is a worldwide scientific community engaged in tracking the SARS-CoV-2 virus itself.

Efrem Lim leads a team at ASU that looks at how the virus may be spreading, mutating and adapting over time.

To trace the trail of the virus worldwide, Lim’s team is using a new technology called next-generation sequencing at ASU’s Genomics Facility, to rapidly read through all 30,000 chemical letters of the SARS-CoV-2 genetic code, called a genome.

Each sequence is deposited into a worldwide gene bank, run by a nonprofit scientific organization called GISAID.

To date, over 16,000 SARS-CoV-2 sequences have been deposited GISAID’s EpiCoVTM Database. The sequence data shows that SARS-CoV-2 originated a single source from Wuhan, China, while many of the first Arizona cases analyzed showed travel from Europe as the most likely source.

Now, using a pool of 382 nasal swab samples obtained from possible COVID-19 cases in Arizona, Lim’s team has identified a SARS-CoV-2 mutation that had never been found before, where 81 of the letters have vanished, permanently deleted from the genome.

The study was published in the online version of the Journal of Virology.

Lim says as soon as he made the manuscript data available on a preprint server medRxiv, it has attracted worldwide interest from the scientific community, including the World Health Organization.

“One of the reasons why this mutation is of interest is because it mirrors a large deletion that arose in the 2003 SARS outbreak,” said Lim, an assistant professor at ASU’s Biodesign Institute.

During the middle and late phases of the SARS epidemic, SARS-CoV accumulated mutations that attenuated the virus. Scientists believe that a weakened virus that causes less severe disease may have a selective advantage if it is able to spread efficiently through populations by people who are infected unknowingly.

Teasing apart what exactly this means is of profound interest to Lim and his colleagues. The ASU research team includes LaRinda A. Holland, Emily A. Kaelin, Rabia Maqsood, Bereket Estifanos, Lily I. Wu, Arvind Varsani, Rolf U. Halden, Brenda G. Hogue and Matthew Scotch.

The ASU virology team had been setup to perform research on seasonal flu viruses, but when the 3rd case of COVID-19 was found in an Arizona individual on January 26, 2020, they knew they had all technical and scientific prowess to rapidly pivot to examining the spread of SARS-CoV-2.

“This was the scientific opportunity of a lifetime for ASU to be able to contribute to understand how this virus is spreading in our community,” said Lim. “As a team, we knew we could make a significant difference.”

All the positive cases show that the SARS-CoV-2 viral genomes were different from each other, meaning they were independent from each other. This indicates that the new cases were not linked to the first Arizona case in January, but the result of recent travel from different locations.

In the case of the 81-base pair mutation, because it has never been found before in the GISAID database, it could also provide a clue into how the virus makes people sick.

It could also form a new starting point for other scientists to develop antiviral drugs or formulate new vaccines.

A genome deletion removes 27 protein building blocks, called amino acids, from the SARS-CoV-2 accessory protein ORF7a. The protein is very similar to the 2003 SARS-CoV immune antagonist ORF7a/X4. Image is credited to Efrem Lim, ASU Biodesign Institute.

SARS-CoV-2 makes accessory proteins that help it infect its human host, replicate and eventually spread from person to person.

The genome deletion removes 27 protein building blocks, called amino acids, from the SARS-CoV-2 accessory protein ORF7a. The protein is very similar to the 2003 SARS-CoV immune antagonist ORF7a/X4.

The ASU team is now hard at work performing further experiments to understand the functional consequences of the viral mutation. The viral protein is thought to help SARS-CoV-2 evade human defenses, eventually killing the cell.

This frees up the virus to infect other cells in a cascading chain reaction that can quickly cause the virus to make copies of itself throughout the body, eventually causing the serious COVID-19 symptoms 8-14 days after the initial infection.

Lim points out that only 16,000 SARS-CoV-2 genomes have been sequenced to date, which is less than 0.5% of the strains circulating. There are currently more than 3.5 million confirmed COVID-19 cases worldwide.

Lim’s group has teamed up with TGen, UA and Northern Arizona University to continue tracking different genetic strains of the new coronavirus. Together, the newly formed Arizona COVID-19 Genomics Union (ACGU) hopes to use big data analysis and genetic mapping to give Arizona health care providers and public policy makers an edge in fighting the growing pandemic.

Funding: The work was supported by NSF STC Award 1231306, NIH grants R01 LM013129, R00 DK107923, the J.M. Kaplan Foundation’s One Water One Health, Arizona State University Foundation project 30009070), and ASU Core Facilities Seed Funding.

SARS-CoV-2, the novel coronavirus behind COVID-19 pandemic is acquiring new mutations in its genome. Although some mutations provide benefits to the virus against human immune response, a number of them may result in their reduced pathogenicity and virulence.

By analyzing more than 3000 high-coverage, complete genome sequences deposited in the GISAID database, here I report a unique 28881-28883:GGG>AAC trinucleotide- bloc mutation in the SARS-CoV-2 genome that results in two sub-strains, described here as SARS-CoV-2g (28881- 28883:GGG genotype) and SARS-CoV-2a (28881-28883:AAC genotype). Computational analysis and literature review suggest that this bloc mutation would bring 203-204:RG(arginine-glycine)>KR(lysine-arginine) amino acid changes in the nucleocapsid (N) protein affecting the SR (serine-arginine)-rich motif of the protein, a critical region for the transcription of viral RNA and replication of the virus.

Thus, 28881-28883:GGG>AAC bloc-mutation is expected to modulate the pathogenicity of the SARS-CoV-2. Remarkably, SARS-CoV-2g and SARS-CoV-2a strains can be linked with the heterogeneity of COVID-19 cases across different regions within and between countries by analyzing existing data.

Sequence analysis suggests that severely affected cities, such as Milan, Lombardy, New York, Paris have the predominant presence of SARS-CoV-2g strains, whereas less affected places like Abruzzo, Lyon, Valencia have a relatively higher presence of SARS-CoV-2a, an indication that the latter strain may contribute to the reduced cases of COVID-19. A similar relationship is observed when Netherlands, Portugal are compared with Spain, France and Germany.

These analyses suggest that the SARS-CoV-2 has already evolved into a less infective SARS- CoV-2a affecting COVID-19 cases in different regions. The time a country or region needs to acquire SARS-CoV-2a strains may be indicative to the time it would need to overcome the peak of the COVID-19 cases.

To confirm these assumptions, prompt retrospective and prospective epidemiological studies should be conducted in different countries to understand the course of pathogenicity of the SARS-CoV-2a and SARS-CoV-2g. Potential drugs can be designed targeting 28881-28883 region of the N protein to modulate virus pathogenicity.

The genome organization of SARS-CoV-2 is similar to other coronaviruses 1. It has Open Reading Frames (ORFs) common to all beta-coronavirus (Figure-1) which includes ORF1ab responsible for most the enzymatic proteins, the surface glycoproteins (S), the envelope proteins (E), the membrane proteins (M) and the nucleocapsid proteins (N). There are also several nonstructural proteins expressed mostly from ORF3a, ORF6a, ORFF7a and ORF8a. The reference genome of the SARS-CoV-2 also includes ORF10a as part of its genome as shown in figure S1, Table-1.

Figure-1: The genome of the SARS-CoV-2 has a very similar architecture with other beta coronaviruses. Image collected from 1
Table-1: Size and span of the ORFs in SARS-CoV-2 according to the NCBI reference genome sequence

Notably, whole-genome sequencing of the SARS-CoV-2 and deposition to the public databases has been progressing with an unprecedented pace during this outbreak. Up until April 10, 2020, more than 3500 high-coverage, complete genome sequences of SARS-CoV-2 have been submitted to GISAID (Global Initiative on Sharing All Influenza Data) maintained by MPII (Max Planck Institute for Informatics).

After a careful analysis of the whole genome sequences in the GISAID database, this study has established that a unique trinucleotide-bloc mutation, 28881-28883:GGG >AAC might have occurred in recent time giving rise to a new subtype of SARS-CoV-2 with potential impacts on the course of the COVID-19 pandemic. This bloc mutation is mapped within the nucleocapsid (N) gene according to the SARS-CoV-2 reference genome. N protein plays a critical role to assemble coronavirus RNA genome and create a shell around the enclosed nucleic acid. It also interacts with the viral membrane protein during viral assembly, assists in RNA synthesis, folding and virus budding. The protein also affects host cell responses to the viral infection, including cell cycle regulation and immune responses modulation

The 28881-28883:GGG >AAC mutation affects the SR (serine-arginine)-rich domain of the N protein. Previously in SARS-Cov-1 the closest neighbor to SARS-CoV-2, it has been shown that experimentally introduced deletion in the SSRSSSRSRGNSR region of the SR-rich motif significantly reduces the infectious virions 3. The 28881-28883:GGG

AAC mutation affects the location adjacent to the aforementioned region, and so is expected to impact the pathogenicity of the SARS-CoV-2 in a similar manner. This assumption is remarkably supported from the analysis conducted by combining sequence information from GISAID database and COVID-19 cases in different regions around the globe from live trackers. From this exercise, it has become evident that regions with low/moderate cases of COVID-19 have the prevalence of 28881-28883:AAC genotype (SARS-CoV-2a), whereas the highly affected regions predominantly have 28881-28883:GGG genotype (SARS-CoV-2g).

History of previous infections suggests the evolution of viruses with different pathogenicity acquired through mutations 4 5. Although hundreds of mutations have been reported in the SARS-CoV-2 genome to date, the trinucleotide bloc mutation reported and characterized in this study have unique features with potential impact on the pathogenicity of the virus.

The results suggest that by monitoring the prevalence of the SARS-CoV-2a and SARS-CoV-2g strains, countries may track the course of COVID-19 pandemic. Potential drugs can be designed to target SR-rich motif of the N protein to curb the pathogenicity of the SARS-CoV-2.

However, some assumptions need to be confirmed with more retrospective and prospective research. Special attention should be given to trace back the COVID-19 affected human samples from where the SARS-CoV-2 sequences were obtained and follow up with their clinical outcome.


Hundreds of mutations have been reported in SARS-CoV-2 so far and the tally is increasing as more sequences being deposited in the public databases. It is often a challenge to make practical use of those sequences (and mutation) data.

This study reports for the first time the rise and probable impacts of two strains SARS-CoV-2a and SARS-CoV-2 from the original SARS-CoV-2 strain after analyzing available sequence and COVID-19 case data. The mutually exclusive nature of these two strains may work as anchors to follow them both retro-and-prospectively.

The uniqueness of the trinucleotide mutations (28881-2883:GGG>AAC) makes it a highly potential candidate to follow the trend of the COVID-19 pandemic across regions caused by SARS-CoV-2. The molecular analysis presented in this paper has set the ground to assume that SARS-CoV-2a is linked with lower cases of infection because of the mutated SR-motif important for viral replication.

However, this needs to be confirmed by

i) further laboratory experiment on the particular location on the SR motif and

ii) epidemiological research by matching the sequence data from different countries with their COVID-19 patients.

Factors that may contribute to the GGG>AAC conversion should also be investigated. Demography, nutritional status, geographical location, environmental factors may play roles for this conversion as empirically SARS-CoV-2g (GGG) strains seem to be predominant in the megacities.

This study could explain the COVID-19 cases in different courtiers from where reliable data were obtainable. However, an explanation for the fatality difference still remains elusive. In a comparison between Lombardy and Abruzzo, it appears that the lethality is less in SARS-CoV-2a infected areas.

This remains true when different regions of Netherlands were compared. However, when a country-wise comparison is made, the picture is not clear-cut. Notably, Germany (with a low prevalence of SARS-CoV-2a strains and higher COVID-19 cases) has much lower fatality compared to Netherlands or Brazil, both of which have a higher presence of SARS-CoV-2a. An obvious explanation is the difference in the healthcare provisions, age distributions and other local and policy differences in different countries.

Nevertheless, based on the information on the two strains of SARS-CoV-2, the fatality can be discussed from molecular perspective too. Among the mutations differences between the two strains as discussed above, it is particularly important to note that the ORF3a gene in the SARS-CoV-2a strain remains unmutated compared the SARS-CoV-2g where in many cases either 25563:G>A or 26144:G>A mutations are present in a mutually exclusive manner.

It is already known that ORF3a plays a critical role to induce over reaction from inflammatory cytokines which often leads to the ‘cytokine storms’ 13, one of the most important reasons behind the fatality from COVID-19.

The complete absence of 25563:G>T and 26144:C>T mutations in the SARS-CoV-2a indicates that this strain will express an active ORF3a protein whereas more than 40% SARS-CoV-2g strains might be mutated for this gene (~33% 25563:G>T and ~9% 26144:G>T) (Figure-3).

This implies that SARS-CoV-2a, although will have less infectivity because of the mutated N protein, this strain might be more lethal than those SARS-CoV-2g with ORF3a mutations.

This explanation is supported by the sequence data from Germany where 45% (N=52) strains are mutated for 25563:G>T and 6% (N=52) for 26144:G>T. This extrapolation should be considered with caution as there might be other attenuating mutations and confounding factors.

However, if 28881-28883:GGG>AAC is a decisive change that makes the SARS-CoV-2a less pathogenic compared to the SARS-CoV-2g, then 203-204:RG>KR positions of the N protein should be targeted to design drugs to affect the replication of the virus and thus reduce the pathogenicity of SARS-CoV-2 infection.

Mathematical models to predict the course of the COVID-19 pandemic should consider the impact of 28881-2883:GGG>AAC mutation in the SARS-CoV-2 genome to better understand the course of the infection and guide nations’ preparedness.

For nations with no elaborate facilities for whole-genome sequencing, RT-PCR based testing should be recommended by targeting 28881-28883 region. This will give diagnostic information on COVID-19 together with the information on the two sub-strains: SARS-CoV-2a and SARS-CoV-2g in an infected person. This will allow gathering valuable information about the prevalence of these two strains are prevalent in those countries.

This work further recommends more active efforts to look into the genomes of the SARS-CoV-2 with closer pan- national collaboration to understand the transitions and distributions of the SARS-CoV-2a and SARS-CoV-2g strains for better understanding and management of COVID-19. Experimental and epidemiolocal research together with genome information will be key to make use of the analysis and assumptions presented in this paper.


  1. Wu, F. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020).
  2. McBride, R., van Zyl, M. & Fielding, B. C. The coronavirus nucleocapsid is a multifunctional protein.
    Viruses 6, 2991–3018 (2014).
  3. Tylor, S. et al. The SR-rich motif in SARS-CoV nucleocapsid protein is important for virus replication. Can. J. Microbiol. 55, 254–260 (2009).
  4. Dyer, O. Two strains of the SARS virus sequenced. ;. BMJ. 326(7397):, 999 (2003).
  5. Marra, M. A. et al. The genome sequence of the SARS-associated coronavirus. Science (80-. ). 300, 1399– 1404 (2003).
  6. Yin, C. Genotyping coronavirus SARS-CoV-2: methods and implications.
  7. Stefanelli, P. et al. Whole genome and phylogenetic analysis of two SARS-CoV-2 strains isolated in Italy in January and February 2020: additional clues on multiple introductions and further circulation in Europe. Eurosurveillance 25, 2000305 (2020).
  8. Menachery, V. D. et al. MERS-CoV accessory orfs play key role for infection and pathogenesis. MBio 8, (2017).
  9. Liu, C. I., Hsu, K. Y. & Ruaan, R. C. Hydrophobic contribution of amino acids in peptides measured by hydrophobic interaction chromatography. J. Phys. Chem. B 110, 9148–9154 (2006).
  10. Peng, T.-Y., Lee, K.-R. & Tarn, W.-Y. Phosphorylation of the arginine/serine dipeptide-rich motif of the severe acute respiratory syndrome coronavirus nucleocapsid protein modulates its multimerization, translation inhibitory activity and cellular localization. FEBS J. 275, 4152–4163 (2008).
  11. Testing rate for COVID-19 select countries worldwide 2020 | Statista. Available at: https://www.statista.com/statistics/1104645/covid19-testing-rate-select-countries-worldwide/. (Accessed: 11th April 2020)
  12. Coronavirus cases worldwide 2020 | Statista. Available at: https://www.statista.com/statistics/1043366/novel-coronavirus-2019ncov-cases-worldwide-by-country/. (Accessed: 11th April 2020)
  13. Siu, K. et al. Severe acute respiratory syndrome Coronavirus ORF3a protein activates the NLRP3 inflammasome by promoting TRAF3‐dependent ubiquitination of ASC. FASEB J. 33, 8865–8877 (2019).
  14. Waterhouse, A. M., Procter, J. B., Martin, D. M. A., Clamp, M. & Barton, G. J. Jalview Version 2-A multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009).
  15. Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539–539 (2011).

Arizona State University


Please enter your comment!
Please enter your name here

Questo sito usa Akismet per ridurre lo spam. Scopri come i tuoi dati vengono elaborati.