In the field of molecular epidemiology, the worldwide scientific community has been sleuthing to solve the riddle of the early history of SARS-CoV-2.
Since the first SARS-CoV-2 virus infection was detected in December 2019, tens of thousands of its genomes have been sequenced worldwide, revealing that the coronavirus is mutating, albeit slowly, at a rate of 25 mutations per genome per year.
But despite major efforts, no one to date has identified the first case of human transmission, or “patient zero” in the COVID-19 pandemic.
Finding such a case is necessary to better understand how the virus may have jumped from its animal host first to infect humans as well as the history of how the SARS-CoV-2 viral genome has mutated over time and spread globally.
“The SARS-CoV-2 virus is carrying an RNA genome that has already infected more than 35 million people across the world,” said Sudhir Kumar, director of the Institute for Genomics and Evolutionary Medicine, Temple University. “We need to find this common ancestor, which we call the progenitor genome.”
This progenitor genome is the mother of all SARS-CoV-2 coronaviruses infecting people today.
In the absence of patient zero, Kumar and his Temple University research team now may have found the next best thing to aid the worldwide molecular epidemiology detective work.
“We set out to reconstruct the genome of the progenitor by using a big dataset of coronavirus genomes obtained from infected individuals,” said Sayaka Miura, a senior author of the study.
They found the “mother” of all SARS-CoV-2 genomes and its early offspring strains have subsequently mutated and spread to dominate the world pandemic. “We have now reconstructed the progenitor genome and mapped where and when the earliest mutations happened,” said Kumar, the corresponding author of a preprint study, which can be found on the bioRxiv server.
In doing so, their work has provided new insights into the early mutational history of SARS-CoV-2.
For example, their study reports that a mutation of the SARS-CoV-2 spike protein (D416G), often implicated in increased infectivity and spread, occurred after many other mutations, weeks after COVID-19 started.
“It is nearly always found alongside many other protein mutations, so its role in increased infectivity remains difficult to establish,” said Sergei Pond, a senior co-author of the study.
Besides their findings on SARS-CoV-2’s early history, Kumar’s group has developed mutational fingerprints to quickly recognize strains and sub-strains infecting an individual or colonizing a global region.
Order to a pandemic
To identify the progenitor genome, they used a mutational order analysis technique, which relies on a clonal analysis of mutant strains and the frequency in which pairs of mutations appear together in the SARS-CoV-2 genomes.
First, Kumar’s team sifted through data on almost 30,000 complete genomes of the SARS-CoV-2, the virus that causes COVID-19.
Altogether, they analyzed 29,681 SARS-CoV-2 genomes, each containing at least 28,000 bases of sequence data. These genomes were sampled between 24 December 2019 and 07 July 2020, representing 97 countries and regions worldwide.
Many previous attempts in analyzing such large datasets were not successful because of “the focus on building an evolutionary tree of SARS-CoV-2,” says Kumar.
“This coronavirus evolves too slow, the number of genomes to analyze is too large, and the data quality of genomes is highly variable. I immediately saw parallels between the properties of these genetic data from coronavirus with the genetic data from the clonal spread of another nefarious disease, cancer.”
Kumar’s group has developed and investigated many techniques for analyzing genetic data from tumors in cancer patients.
They adapted and innovated those techniques and built a trail of mutations that automatically traces back to the progenitor. “Basically, the genome before the first mutation was that of the progenitor,” said Kumar.
“The mutation tracking approach is beautiful and predicts a phylogeny of “major strains” of SARS-CoV-2. It is a great example of how big data coupled with biologically-informed data mining reveals important patterns.”
Kumar’s team uncovered a predicted sequence of the progenitor (mother) genome of all SARS-CoV-2 genomes (proCoV2).
In the proCoV2 genome, they identified 170 non-synonymous (mutations that cause an amino acid change in a protein) and 958 synonymous substitutions compared with the genome of a closely-related coronavirus, RaTG13, found in a Rhinolophus affinis bat.
While the intermediary animal from bats to humans is still unknown, this amounted to a 96.12% sequence similarity between proCoV2 and RaTG13 sequences.
Next, they identified 49 single nucleotide variants (SNVs) that occurred with a greater than 1% variant frequency from their dataset. These were further examined to look at their mutational patterns and global spread.
“The tree of mutations predicts a tree of strains,” said Kumar. “You can also do the tree of strains first, and predict the order of mutations. However, this way is greatly affected by the quality of sequences. When the mutation rate is low, it becomes hard to distinguish between error due to low quality and a real mutation. The approach we took is much more robust against sequencing errors because analysis of pairs of positions across genomes is more informative.”
An earlier timeline emerges
When comparing the inferred proCoV2 sequence with genomes in their collection revealed no full matches at the nucleotide level, Kumar’s research team knew the original timing of the start of the pandemic was off.
“This progenitor genome had a sequence different from what some folks are calling the reference sequence, which is what was observed first in China and deposited into the GISAID SARS-CoV-2 database,” said Kumar.
The closest match was to genomes sampled 12 days after the earliest sampled virus that became available on 24 December 2019.
Multiple matches were found in all sampled continents and detected as late as April 2020 in Europe. Overall, 120 genomes Kumar’s group analyzed all contained only synonymous differences from proCoV2.
That is, all their proteins were identical to the corresponding proCoV2 proteins in the amino acid sequence. A majority (80 genomes) of these protein-level matches were from coronaviruses sampled in China and other Asian countries.
These spatiotemporal patterns suggested that proCoV2 already possessed the full repertoire of protein sequences needed to infect, spread and persist in the global human population.
They found the proCoV2 virus and its initial descendants arose in China, based on the earliest mutations of proCoV2 and their locations. Furthermore, they also demonstrated that a population of strains with as many as six mutational differences from proCoV2 existed at the time of the first detection of COVID-19 cases in China.
With estimates of SARS-CoV-2 mutating 25 times per year, this meant that the virus must already have been infecting people several weeks before the December 2019 cases.
Because there was strong evidence of many mutations before the ones found in the reference genome, Kumar’s group had to come up with a new nomenclature of mutational signatures to classify SARS-CoV-2 and account for these by introducing a series of Greek letter symbols to represent each one.
For example, they found that the emergence of μ and α SARS-CoV-2 genome variants came before the first reports of COVID-19. This strongly implies the existence of some sequence diversity in the ancestral SARS-CoV-2 populations.
All 17 of the genomes sampled from China in December 2019, including the designated SARS-CoV-2 reference genome, carry all three μ and three α variants. Interestingly, the six genomes containing μ variants but not α variants were sampled in China and the United States in January 2020. Therefore, the earliest sampled genomes (including the designated reference) were not the progenitor strains.
It also predicts the progenitor genome had offspring that were spreading worldwide during the earliest phases of COVID-19. It was ready to infect right from the start.
“The progenitor had all the ability it needed to spread,” said Sergei Pond. “There is little evidence of selection on lineages between bats and humans, although there is strong selection on coronaviruses in bats.”
Furthermore, they found confounding evidence that there was always another mutation that accompanied the D416G spike protein mutation.
“Many people are interested in mutations in the spike protein because of its functional properties,” said Kumar.
“But what we are observing is that in addition to the spike protein, there were several additional changes within the genome that are always found along with the changes in the spike protein (D416G). We call these a beta group of mutations, and the spike mutation is one of them. Whatever we think the spike mutation is doing, it is best not to forget that other mutations may also be involved. Alternatively, these mutations may be simply hitchhiking together, we yet cannot tell.”
“What is also interesting is that the genome containing the spike protein mutation underwent many other mutations. And what we call epsilon mutations (there are 3 of these) occurred on the background of the spike mutation, and they change arginine residues in a very important protein, the nucleocapsid (N) protein.
The epsilon mutations are widespread in Europe, and they are always found with the spike protein mutation. So, epsilon mutations started a dominant offshoot in both Europe and Asia.”
A global spread
Altogether, they have identified seven major evolutionary lineages that arose after the pandemic began, some of which arose in Europe and North America after the genesis of the ancestral lineages in China.
“Asian strains founded the whole pandemic,” said Kumar. “But over time, it is the sub-strain containing the epsilon mutation, that may have occurred outside of China (first observed in the middle east and Europe), is infecting Asia much more.”
Their mutational-based analyses also established that North American coronaviruses harbor very different genome signatures than those prevalent in Europe and Asia.
“This is a dynamic process,” said Kumar. “Clearly, there are very different pictures of spread that are painted by the emergence of new mutations, the three epsilons, gamma, and delta, which we found to occur after the spike protein change. We need to find out if any functional properties of these mutations have sped up the pandemic.”
Moving forward, they will continue to refine their results as new data becomes available.
“There are more than 100,000 SARS-CoV-2 genomes that have been sequenced now,” said Pond. Kumar says that “the power of this approach is that the more data you have, the more easily you can tell the precise frequency of individual mutations and mutation pairs.
These variants that are produced, the single nucleotide variants, or SNVs, their frequency, and history can be told very well with more data. Therefore, our analyses infer a credible root for the SARS-CoV-2 phylogeny.”
Their results are being automatically updated online as new genomes are reported (which now exceeds 50,000 samples and can be found at http://igem.temple.edu/COVID-19).
“These findings and our intuitive mutational fingerprints of SARS-CoV-2 strains have overcome daunting challenges to develop a retrospective on how, when and why COVID-19 has emerged and spread, which is a prerequisite to creating remedies to overcome this pandemic through the efforts of science, technology, public policy and medicine,” said Kumar.
Until late 2019, only six coronaviruses were known to infect humans: HCoV-229E, HCoV-OC43, SARS-CoV (SARS-CoV-1), HCoV-NL63, CoV-HKU1, and MERS-CoV. A seventh, SARS-CoV-2, emerged in the winter of 2019 from Wuhan, China. SARS-CoV-2 is closely related to SARS-CoV-1, a virus that appeared from Guangdong province, China in late 2002.
The coronavirus spike (S) protein mediates receptor binding and fusion of the viral and cellular membrane. The S protein extends from the viral membrane and is uniformly arranged as trimers on the virion surface to give the appearance of a crown (corona in Latin). The coronavirus S protein is divided into two domains: S1 and S2.
The S1 domain mediates receptor binding, and the S2 mediates downstream membrane fusion1,2. The receptor for SARS-CoV-2 is angiotensin-converting enzyme 2 (ACE2)3–7, a metalloprotease that also serves as the receptor for SARS-CoV-18. A small, independently folded subdomain of S1, described as the receptor-binding domain (RBD), directly binds ACE2 when the virus engages a target cell9–12.
The S1/S2 junction of SARS-CoV-2 is processed by a furin-like proprotein convertase in the virus producer cell. In contrast, the S1/S2 junction of SARS-CoV-1 is processed by TMPRSS2 at the cell surface or by lysosomal cathepsins in the target cells13–18. Both S proteins are further processed in the target cell within the S2 domain at the S2’ site, an event that is also required for productive infection19,20.
Recent analyses of the fine-scale sequence variation of SARS-CoV-2 isolates identified several genomic regions of increased genetic variation21–30. One of these variations encodes a S-protein mutation, D614G, in the carboxy(C)-terminal region of the S1 domain21–23,26,30.
This region of the S1 domain directly associates with S2 (Fig. 1a). This mutation with glycine at the residue 614 (G614) was previously detected to increase with an alarming speed21,22. Our own analysis of the S-protein sequences available from the GenBank showed a similar result: The G614 genotype was not detected in February (among 33 sequences) and observed at low frequency in March (26%), but increased rapidly by April (65%) and May (70%) (Fig. 1b), indicating a transmission advantage over viruses with D614. Korber et al. noted that this change also correlated with increased viral loads in COVID-19 patients22, but because this change is also associated with the mutations in viral nsp3 and RdRp proteins, the role of the S-protein in these observations remained undefined.
To determine if the D614G mutation alters the properties of the S-protein in a way that could impact transmission or replication, we assessed its role in viral entry. Maloney murine leukemia virus (MLV)-based pseudoviruses (PVs), expressing green fluorescent protein (GFP) and pseudotyped with the S protein of SARS-CoV-2 (SARS2) carrying the D614 or G614 genotype (SD614 and SG614, respectively) were produced from transfected HEK293T cells as previously described31.
An SD614 variant, in which the furin-cleavage motif between the S1 and S2 domains is ablated (SFKO), was also included for comparison. HEK293T cells transduced to express human ACE2 (hACE2-293T) or those transduced with vector alone (Mock-293T) were infected with the same particle numbers of the PVs pseudotyped with the SD614, SG614, or SFKO (PVD614, PVG614, or PVFKO, respectively), and infection level was assessed one day later.
We observed PVG614 infected hACE2-293T cells with approximately 9-fold higher efficiency than did PVD614 (Fig. 1c,,d).d). This enhanced infectivity of PVG614 is not an artifact of PV titer normalization, as their titers are very similar (Extended Data Fig. 1).
We next investigated the mechanism with which SG614 increased virus infectivity. Because S1 residue 614 is proximal to the S2 domain, we first compared the ratio between the S1 and S2 domains in the virion that might indicate altered release or shedding of the S1 domain after cleavage at the S1/S2 junction.
To do so, we used S-protein constructs bearing Flag tags at both their amino (N)- and C-termini. PVs pseudotyped with these double-Flag tagged forms of SD614, SG614, and SFKO were partially purified and concentrated by pelleting through a 20% sucrose layer32 and evaluated for their infectivity.
The titers of PVs were similar among PVG614, PVD614, and PVFKO before and after purification (Fig. 2a). In addition, modification by Flag-tags or pelleting of PVs through a sucrose layer did not alter the relative infectivity between PVG614 and PVD614 (Fig. 2b).
We then determined their S1:S2 ratio by western blotting using the anti-Flag M2 antibody. As shown in Fig. 2c, the S1:S2 ratio is markedly greater in PVG614 compared to PVD614, indicating that glycine at residue 614 of SG614 stabilizes the interaction between the S1 and S2 domains, limiting S1 shedding. In addition, the total amount of the S protein in PVG614 is also much higher than that in PVD614, as indicated by a denser S2 band, even though the same number of pseudovirions was analyzed, as determined by quantitative PCR.
To independently confirm that similar number of virions was analyzed, the lower part of the same membrane was blotted with an anti-p30 MLV gag antibody (Fig. 2c). Similar densities of p30 bands were observed from all PVs, indicating that differences in S-protein incorporation observed with PVG614 and PVD614 were due to the mutation of residue 614, not by different amount of PVs analyzed.
A similar experiment performed with independently produced PVs yielded a nearly identical result (Extended Data Fig. 2a). Densitometric analysis shows there is 4.7 times more S1+S2 band in PVG614 compared to PVD614 (Fig. 2d). To more accurately estimate the S1:S2 ratio, we next compared different amount of the same samples so that S2-band intensity in PVG614 and PVD614 was comparable (Fig. 2e).
Averages of several quantification show that the S1:S2 ratio of PVG614 is 3.5 times higher than that of PVD614 (Fig. 2f). The M2 antibody used in this experiment binds the Flag tag located at both the N- and C-termini of a protein, but it binds N-terminal Flag tag more efficiently33.
Therefore, we directly visualized virion S-protein bands by silver staining (Extended Data Fig. 2b). Although the S2 bands are masked by a same-sized MLV protein, S1 bands are well separated. Again, the intensity of the S1 band of PVG614 is much stronger than that of PVD614, while p30 bands are comparable, a result consistent with those observed using the anti-Flag M2 antibody.
We next confirmed these findings using virus-like particles (VLPs) composed only of the native SARS-CoV-2 proteins, the nucleoprotein (N), membrane protein (M), envelope protein (E), and S protein34. VLPs were partially purified and analyzed in the same way as MLV PVs. The S protein bands were detected with the anti-Flag M2 antibody, and the N protein with pooled convalescent plasma derived from COVID-19 patients.
The S1:S2 ratio and total S protein on the virion was again much higher in the VLPs carrying SG614 (VLPG614) compared to those carrying SD614 (VLPD614) (Fig. 3a). The S1:S2 ratio is 3.4 fold higher and the total S protein is nearly five fold enriched in VLPG614 compared to VLPD614 (Fig. 3b,,c).c). Thus, the D614G mutation enhances virus infection through two related mechanisms: It reduces S1 shedding and increases total S protein incorporated into the virion.
It has previously been speculated that D614G mutation promotes an open configuration of the S protein that is more favorable to ACE2 association5,22,23,35. To explore this possibility, we investigated whether ACE2 binding by SG614 was more efficient than that by SD614. HEK293T cells transfected to express each S protein were assessed for their binding of hACE2 immunoadhesin, using hACE2-NN-Ig, whose enzymatic activity was abolished by mutation31. hACE2-NN-Ig bound SARS-CoV-1 RBD with an equivalent affinity as hACE2-Ig.
These S proteins are fused to a C-terminal, but not an N-terminal Flag tag, thus allowing for the measurement of total S protein expression in permeabilized cells by flow cytometry. Although total S protein expression was comparable, hACE2-NN-Ig binding to the cells expressing SG614 was substantially higher than its binding to cells expressing SD614 (Fig. 4a).
This observation has several explanations. First, G614 could increase hACE2 association by promoting greater exposure of the RBD, or second, this mutation could increase the number of binding sites by limiting S1-domain shedding. To differentiate these possibilities, we appended the Myc-tag to the N-terminus of the S-protein that is Flag tagged at its C-terminus and repeated the study, this time detecting the S1 domain with an anti-Myc antibody.
As shown in Fig. 4b, the ratio of Myc-tag to Flag-tag is higher in cells expressing SG614 than in cells expressing SD614. However, the Myc-tag/Flag-tag ratio is similar to the hACE2-NN-Ig/Flag-tag ratio, indicating that increased hACE2 binding to the SG614-expressing cells did not result from increased affinity of SG614 spikes to hACE2 or greater access to the RBD.
Instead, these data show there is more S1 domain in the SG614-expressing cells, a result again consistent with the observation that the D614G mutation reduces S1 shedding. We then assessed whether differential amount of the S protein could influence neutralization sensitivity of the virus. Fig. 4c shows that PVD614 and PVG614 are similarly susceptible to neutralizing antisera, indicating that antibody-mediated control of viruses carrying SD614 and SG614 would be similar.
It has also been speculated that the D614G mutation would promote, not limit, shedding of the S1 domain, based on the hypothetical loss of a hydrogen bond between D614 in S1 and T859 in S222. An alternative explanation, more consistent with the data presented here, is that Q613 forms a hydrogen bond with T859, and the greater backbone flexibility provided by introduction of glycine at an adjacent position 614 enables a more favorable orientation of Q613.
It is also possible that D614 can form an intra-domain salt bridge with R646, promoting a local S1 conformation unfavorable to its association with S2. In this model, replacing aspartic acid with glycine at the position 614 would prevent sampling of this unfavorable configuration.
The instability of SD614 may also account for the observed lower level of incorporation of the functional S protein into PVs and VLPs. Specifically, the S-protein trimers with the exposed S2 domains, as a result of S1 shedding, could destabilize the trans-Golgi network membrane, the site of processing of the S1/S2 boundary, and such disruption may impede S-protein incorporation into the virion.
In case of VLPs, this disruption would presumably further interfere with appropriate M- and N-protein associations by altering the conformation and orientation of the S-protein membrane-proximal regions. Alternatively, these S-protein trimers with the exposed S2 domains may serve as poor substrates for downstream post-translational modifications including palmitoylation, and those lacking proper modifications might be unsuitable for virion incorporation.
An interesting question is why viruses carrying the more stable SG614 appear to be more transmissible without resulting in a major observable difference in disease severity22,27. It is possible that higher levels of functional S protein observed with SG614 increase the chance of host-to-host transmission, but that other factors limit the rate and efficiency of intra-host replication.
Alternatively, the loss of virion-associated S proteins observed with SD614 may be compensated by greater fusion efficiency with the destabilized S protein when the next target cell is adjacent in a host tissue. It is also possible that our ability to detect sequence changes at this early stage of the pandemic is simply greater than our ability to detect modest differences in pathogenesis.
The strong phenotypic difference we observe here between D614 and G614 suggests that more study on the impact of the D614G mutation on the course of disease is warranted.
Finally, our data raise interesting questions about the natural history of SARS-CoV-2 as it moved presumably from horseshoe bats to humans. At some point in this process, the virus acquired a furin-cleavage site, allowing its S1/S2 boundary to be cleaved in virus-producing cells.
In contrast, the S1/S2 boundary of SARS-CoV-1, and indeed all SARS-like viruses isolated from bats, lack this polybasic site and are cleaved by TMPRSS2 or endosomal cathepsins in the target cells13–20. Thus the greater stability we observe with SG614 would not be relevant to viruses lacking this site, but it appears to be strongly favored when a furin-cleavage site is present. Therefore, the D614G mutation may have emerged to compensate for this newly acquired furin site.
In summary, we show that an S protein mutation that results in more transmissible SARS-CoV-2 also limits shedding of the S1 domain and increases S-protein incorporation into the virion. Further studies will be necessary to determine the impact of this change on the nature and severity of COVID-19.
reference link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7310631/
More information: Sudhir Kumar et al. An evolutionary portrait of the progenitor SARS-CoV-2 and its dominant offshoots in COVID-19 pandemic, (2020). DOI: 10.1101/2020.09.24.311845