Using molecular dating tools and epidemiological simulations, researchers at University of California San Diego School of Medicine, with colleagues at the University of Arizona and Illumina, Inc., estimate that the SARS-CoV-2 virus was likely circulating undetected for at most two months before the first human cases of COVID-19 were described in Wuhan, China in late-December 2019.
Writing in the March 18, 2021 online issue of Science, they also note that their simulations suggest that the mutating virus dies out naturally more than three-quarters of the time without causing an epidemic.
“Our study was designed to answer the question of how long could SARS-CoV-2 have circulated in China before it was discovered,” said senior author Joel O. Wertheim, Ph.D., associate professor in the Division of Infectious Diseases and Global Public Health at UC San Diego School of Medicine.
“To answer this question, we combined three important pieces of information: a detailed understanding of how SARS-CoV-2 spread in Wuhan before the lockdown, the genetic diversity of the virus in China and reports of the earliest cases of COVID-19 in China. By combining these disparate lines of evidence, we were able to put an upper limit of mid-October 2019 for when SARS-CoV-2 started circulating in Hubei province.”
Cases of COVID-19 were first reported in late-December 2019 in Wuhan, located in the Hubei province of central China. The virus quickly spread beyond Hubei.
Chinese authorities cordoned off the region and implemented mitigation measures nationwide. By April 2020, local transmission of the virus was under control but, by then, COVID-19 was pandemic with more than 100 countries reporting cases.
SARS-CoV-2 is a zoonotic coronavirus, believed to have jumped from an unknown animal host to humans. Numerous efforts have been made to identify when the virus first began spreading among humans, based on investigations of early-diagnosed cases of COVID-19.
The first cluster of cases – and the earliest sequenced SARS-CoV-2 genomes – were associated with the Huanan Seafood Wholesale Market, but study authors say the market cluster is unlikely to have marked the beginning of the pandemic because the earliest documented COVID-19 cases had no connection to the market.
Regional newspaper reports suggest COVID-19 diagnoses in Hubei date back to at least November 17, 2019, suggesting the virus was already actively circulating when Chinese authorities enacted public health measures.
In the new study, researchers used molecular clock evolutionary analyses to try to home in on when the first, or index, case of SARS-CoV-2 occurred.
“Molecular clock” is a term for a technique that uses the mutation rate of genes to deduce when two or more life forms diverged—in this case, when the common ancestor of all variants of SARS-CoV-2 existed, estimated in this study to as early as mid-November 2019.
Molecular dating of the most recent common ancestor is often taken to be synonymous with the index case of an emerging disease. However, said co-author Michael Worobey, Ph.D., professor of ecology and evolutionary biology at University of Arizona: “The index case can conceivably predate the common ancestor – the actual first case of this outbreak may have occurred days, weeks or even many months before the estimated common ancestor. Determining the length of that ‘phylogenetic fuse’ was at the heart of our investigation.”
Based on this work, the researchers estimate that the median number of persons infected with SARS-CoV-2 in China was less than one until November 4, 2019. Thirteen days later, it was four individuals, and just nine on December 1, 2019. The first hospitalizations in Wuhan with a condition later identified as COVID-19 occurred in mid-December.
Study authors used a variety of analytical tools to model how the SARS-CoV-2 virus may have behaved during the initial outbreak and early days of the pandemic when it was largely an unknown entity and the scope of the public health threat not yet fully realized.
These tools included epidemic simulations based on the virus’s known biology, such as its transmissibility and other factors. In just 29.7 percent of these simulations was the virus able to create self-sustaining epidemics. In the other 70.3 percent, the virus infected relatively few persons before dying out. The average failed epidemic ended just eight days after the index case.
“Typically, scientists use the viral genetic diversity to get the timing of when a virus started to spread,” said Wertheim. “Our study added a crucial layer on top of this approach by modeling how long the virus could have circulated before giving rise to the observed genetic diversity.
“Our approach yielded some surprising results. We saw that over two-thirds of the epidemics we attempted to simulate went extinct. That means that if we could go back in time and repeat 2019 one hundred times, two out of three times, COVID-19 would have fizzled out on its own without igniting a pandemic. This finding supports the notion that humans are constantly being bombarded with zoonotic pathogens.”
Wertheim noted that even as SARS-CoV-2 was circulating in China in the fall of 2019, the researchers’ model suggests it was doing so at low levels until at least December of that year.
“Given that, it’s hard to reconcile these low levels of virus in China with claims of infections in Europe and the U.S. at the same time,” Wertheim said. “I am quite skeptical of claims of COVID-19 outside China at that time.”
The original strain of SARS-CoV-2 became epidemic, the authors write, because it was widely dispersed, which favors persistence, and because it thrived in urban areas where transmission was easier. In simulated epidemics involving less dense rural communities, epidemics went extinct 94.5 to 99.6 percent of the time.
The virus has since mutated multiple times, with a number of variants becoming more transmissible.
“Pandemic surveillance wasn’t prepared for a virus like SARS-CoV-2,” Wertheim said. “We were looking for the next SARS or MERS, something that killed people at a high rate, but in hindsight, we see how a highly transmissible virus with a modest mortality rate can also lay the world low.”
A New Human Coronavirus
The first reports of a novel pneumonia (COVID-19) in Wuhan city, Hubei province, China, occurred in late December 2019, although retrospective analyses have identified a patient with symptom onset as early as December 1st. Because the number of SARS-CoV-2 cases is growing rapidly and spreading globally, we will refrain from citing the number of confirmed infections.
However, it is likely that the true number of cases will be substantially greater than reported because very mild or asymptomatic infections will often be excluded from counts. Any under-reporting of case numbers obviously means that the case fatality rate (CFR) associated with COVID-19 in the worst-hit regions will be lower than that currently cited.
CFRs will also vary geographically, between age groups and temporally. Although these uncertainties will likely not be resolved without large-scale serological surveys, from current data it is clear that the CFR for COVID-19 is substantially higher than that of seasonal influenza but lower than that of two closely related coronaviruses that have similarly recently emerged in humans: SARS-CoV, responsible for the SARS outbreak of 2002–2003, and MERS-CoV that since 2015 has been responsible for the ongoing outbreak of MERS largely centered on the Arabian peninsula.
However, it is also evident that SARS-CoV-2 is more infectious than both SARS-CoV and MERS-CoV and that individuals can transmit the virus when asymptomatic or presymptomatic, although how frequently remains uncertain.
An important early association was observed between the first reported cases of COVID-19 and the Huanan seafood and wildlife market in Wuhan city (which we both visited several years ago) where a variety of mammalian species were available for purchase at the time of the outbreak (Figure 1 ).
Given that SARS-CoV-2 undoubtedly has a zoonotic origin, the link to such a “wet” market should come as no surprise. However, as not all of the early cases were market associated, it is possible that the emergence story is more complicated than first suspected. Genome sequences of “environmental samples” – likely surfaces – from the market have now been obtained, and phylogenetic analysis reveals that they are very closely related to viruses sampled from the earliest Wuhan patients.
While this again suggests that the market played an important role in virus emergence, it is not clear whether the samples were derived from people who inadvertently deposited infectious material or from animals or animal matter present at that location. Unfortunately, the apparent lack of direct animal sampling in the market may mean that it will be difficult, perhaps even impossible, to accurately identify any animal reservoir at this location.
After clinical cases began to appear, our research team, along with a number of others, attempted to determine the genome sequence of the causative pathogen (Lu et al., 2020, Wu et al., 2020, Zhou et al., 2020, Zhu et al., 2020). We focused on a patient admitted to the Central Hospital of Wuhan on December 26, 2019, six days after the onset of symptoms (Wu et al., 2020).
This patient was experiencing fever, chest tightness, cough, pain, and weakness, along with lung abnormalities indicative of pneumonia that appear to be commonplace in COVID-19 (Huang et al., 2020). Fortunately, next-generation meta-transcriptomic sequencing enabled us to obtain a complete viral genome from this patient on January 5, 2020. Initial analysis revealed that the virus was closely related to those of SARS-like viruses (family Coronaviridae).
This result was immediately reported to the relevant authorities, and an annotated version of the genome sequence (strain Wuhan-Hu-1) was submitted to NCBI/GenBank on the same day. Although the GenBank sequence (GenBank: MN908947) was the first of SARS-CoV-2 available, it was subsequently corrected to ensure its accuracy.
With the help of Dr. Andrew Rambaut (University of Edinburgh), we released the genome sequence of the virus on the open access Virological website (http://virological.org/) early on January 11, 2020. Afterwards, the China CDC similarly released SARS-CoV-2 genome sequences (with associated epidemiological data) on the public access GISAID database (https://www.gisaid.org/).
At the time of writing, almost 200 SARS-CoV-2 genomes are publicly available, representing the genomic diversity of the virus in China and beyond and providing a freely accessible global resource. Importantly, the release of the SARS-CoV-2 genome sequence data facilitated the rapid development of diagnostic tests (Corman et al., 2020) and now an infectious clone (Thao et al., 2020). The race to develop an effective vaccine and antivirals is ongoing, with trails of the latter underway (Wang et al., 2020).
Comparisons between SARS-CoV-2 and Other Coronaviruses
The earliest genomic genome sequence data made it clear that SARS-CoV-2 was a member of the genus Betacoronavirus and fell within a subgenus (Sarbecovirus) that includes SARS-CoV (MERS-CoV falls in a separate subgenus, Merbecovirus) (Lu et al., 2020, Wu et al., 2020, Zhou et al., 2020, Zhu et al., 2020).
Indeed, initial comparisons revealed that SARS-CoV-2 was approximately 79% similar to SARS-CoV at the nucleotide level. Of course, patterns of similarity vary greatly between genes, and SARS-CoV and SARS-CoV-2 exhibit only ∼72% nucleotide sequence similarity in the spike (S) protein, the key surface glycoprotein that interacts with host cell receptors.
Given these close evolutionary relationships, it is unsurprising that the genome structure of SARS-CoV-2 resembles those of other betacoronaviruses, with the gene order 5′-replicase ORF1ab-S-envelope(E)-membrane(M)-N-3′. The long replicase ORF1ab gene of SARS-CoV-2 is over 21 kb in length and contains 16 predicted non-structural proteins and a number of downstream open reading frames (ORFs) likely of similar function to those of SARS-CoV.
Comparative genomic analysis has been greatly assisted by the availability of a related virus from a Rhinolophus affinis (i.e., horseshoe) bat sampled in Yunnan province, China, in 2013 (Zhou et al., 2020). This virus, denoted RaTG13, is ∼96% similar to SARS-CoV-2 at the nucleotide sequence level.
Despite this sequence similarity, SARS-CoV-2 and RaTG13 differ in a number of key genomic features, arguably the most important of which is that SARS-CoV-2 contains a polybasic (furin) cleavage site insertion (residues PRRA) at the junction of the S1 and S2 subunits of the S protein (Coutard et al., 2020).
This insertion, which may increase the infectivity of the virus, is not present in related betacoronaviruses, although similar polybasic insertions are present in other human coronaviruses, including HCoV-HKU1, as well as in highly pathogenic strains of avian influenza virus. In addition, the receptor binding domain (RBD) of SARS-CoV-2 and RaTG13 are only ∼85% similar and share just one of six critical amino acid residues.
Both sequence and structural comparisons suggest that the SARS-CoV-2 RBD is well suited for binding to the human ACE2 receptor that was also utilized by SARS-CoV (Wrapp et al., 2020). Importantly, an independent insertion(s) of the amino acids PAA at the S1/S2 cleavage site was recently observed in a virus (RmYN02) sampled in mid-2019 from another Rhinolophus bat in Yunnan province, indicating that these insertion events reflect a natural part of ongoing coronavirus evolution (Zhou et al., 2020). While RmYN02 is relatively divergent from SARS-CoV-2 in the S protein (∼72% sequence similarity), it is the closest relative (∼97% nucleotide sequence similarity) of the human virus in the long replicase gene.
Although SARS-CoV and MERS-CoV are both closely related to SARS-CoV-2 and have bat reservoirs, the biological differences between these viruses are striking. As noted above, SARS-CoV-2 is markedly more infectious, resulting in very different epidemiological dynamics to those of SARS-CoV and MERS-CoV.
In these latter two viruses, there was a relatively slow rise in case numbers, and MERS-CoV has never been able to fully adapt to human transmission: the majority of the cases are due to spillover from camels on the Arabian peninsula with only sporadic human-to-human transmission (Sabir et al., 2016). In contrast, the remarkable local and global spread of SARS-CoV-2 caught most by surprise. Determining the virological characteristics that underpin such transmissibility is clearly a priority.
The Zoonotic Origins of SARS-CoV-2
The emergence and rapid spread of COVID-19 signifies a perfect epidemiological storm. A respiratory pathogen of relatively high virulence from a virus family that has an unusual knack of jumping species boundaries, that emerged in a major population center and travel hub shortly before the biggest travel period of the year: the Chinese Spring Festival. Indeed, it is no surprise that epidemiological modeling suggests that SARS-CoV-2 had already spread widely in China before the city of Wuhan was placed under strict quarantine (Chinazzi et al., 2020).
It was also no surprise that early genomic comparisons revealed that the most closely related viruses to SARS-CoV-2 came from bats (Zhou et al., 2020). Sampling in recent years has identified an impressive array of bat coronaviruses, including RaTG13 and RmYN02 (Hu et al., 2017, Yang et al., 2015). Hence, bats are undoubtedly important reservoir species for a diverse range of coronaviruses (Cui et al., 2019).
Despite this, the exact role played by bats in the zoonotic origin of SARS-CoV-2 is not established. In particular, the bat viruses most closely related to SARS-CoV-2 were sampled from animals in Yunnan province, over 1,500 km from Wuhan. There are relatively few bat coronaviruses from Hubei province, and those that have been sequenced are relatively distant to SARS-CoV-2 in phylogenetic trees (Lin et al., 2017).
The simple inference from this is that our sampling of bat viruses is strongly biased toward some geographical locations. This will need to be rectified in future studies. In addition, although sequence similarity values of 96%–97% make it sound like the available bat viruses are very closely related to SARS-CoV-2, in reality this likely represents more than 20 years of sequence evolution (although the underlying molecular clock may tick at an uncertain rate if there was strong adaptive evolution of the virus in humans).
It is therefore almost a certainty that more sampling will identify additional bat viruses that are even closer relatives of SARS-CoV-2. A key issue is whether these viruses, or those from any other animal species, contain the key RBD mutations and the same furin-like cleavage site insertion as found in SARS-CoV-2.
Although bats are likely the reservoir hosts for this virus, their general ecological separation from humans makes it probable that other mammalian species act as “intermediate” or “amplifying” hosts, within which SARS-CoV-2 was able to acquire some or all of the mutations needed for efficient human transmission.
In the case of SARS and MERS, civets and camels, respectively, played the role of intermediate hosts, although as MERS-CoV was likely present in camels for some decades before it emerged in humans during multiple cross-species events, these animals may be better thought of as true reservoir hosts (Sabir et al., 2016). To determine what these intermediate host species might be, it is imperative to perform a far wider sampling of animals from wet markets or that live close to human populations.
This is highlighted by the recent discovery of viruses closely related to SARS-CoV-2 in Malayan pangolins (Manis javanica) illegally imported into southern China (Guangdong and Guangxi provinces). The Guangdong pangolin viruses are particularly closely related to SARS-CoV-2 in the RBD, containing all six of the six key mutations thought to shape binding to the ACE2 receptor and exhibiting 97% amino acid sequence similarity (although they are more divergent from SARS-CoV-2 in the remainder of the genome).
Although pangolins are of great interest because of how frequently they are involved in illegal trafficking and their endangered status, that they carry a virus related to SARS-CoV-2 strongly suggests that a far greater diversity of related betacoronaviruses exists in a variety of mammalian species but has yet to be sampled.
While our past experience with coronaviruses suggests that evolution in animal hosts, both reservoirs and intermediates, is needed to explain the emergence of SARS-CoV-2 in humans, it cannot be excluded that the virus acquired some of its key mutations during a period of “cryptic” spread in humans prior to its first detection in December 2019.
Specifically, it is possible that the virus emerged earlier in human populations than envisaged (perhaps not even in Wuhan) but was not detected because asymptomatic infections, those with mild respiratory symptoms, and even sporadic cases of pneumonia were not visible to the standard systems used for surveillance and pathogen identification.
During this period of cryptic transmission, the virus could have gradually acquired the key mutations, perhaps including the RBD and furin cleavage site insertions, that enabled it to adapt fully to humans. It wasn’t until a cluster of pneumonia cases occurred that we were able to detect COVID-19 via the routine surveillance system. Obviously, retrospective serological or metagenomic studies of respiratory infection will go a long way to determining whether this scenario is correct, although such early cases may never be detected.
Another issue that has received considerable attention is whether SARS-CoV-2 is a recombinant virus, and whether such recombination might have facilitated its emergence (Lu et al., 2020, Wu et al., 2020). The complicating factor here is that sarbeviruses, and coronaviruses more broadly, experience widespread recombination, so that distinguishing recombination that assisted virus emergence from “background” recombination events is not trivial.
Recombination is visible at multiple locations across the sarbevirus genome, including in the S protein, and in bat viruses closely related to SARS-CoV-2. For example, there is some evidence for recombination among SARS-CoV-2, RaTG13, and the Guangdong pangolin CoVs (Lam et al., 2020), and the genome of RmYN02 has similarly been widely impacted by recombination (Zhou et al., 2020).
However, trying to determine the exact pattern and genomic ancestry of recombination events is difficult, particularly as many of the recombinant regions may be small and are likely to change as we sample more viruses related to SARS-CoV-2. To resolve these issues, it will again be necessary to perform a far wider sampling of viral diversity in animal populations.
Ongoing Genomic Evolution of SARS-CoV-2
As the COVID-19 epidemic has progressed, so more viral genomes have been sequenced. As expected given their recent common ancestry, the earliest samples from Wuhan contained relatively little genetic diversity. While this can prevent detailed phylogenetic and phylogeogaphic inferences, it does show that the public health authorities in Wuhan did a remarkable job in detecting the first cluster of pneumonia cases.
However, this seemingly recent common ancestry does not exclude a pre-outbreak period of cryptic transmission in humans. Although accumulating genetic diversity means that it is now possible to detect distinct phylogenetic clusters of SARS-CoV-2 sequences, it is difficult to determine using genomic comparisons alone whether the virus is fixing phenotypically important mutations as it spreads through the global population, and any such claims require careful experimental verification.
Given the high mutation rates that characterize RNA viruses, it is obvious that many more mutations will appear in the viral genome and that these will help us to track the spread of SARS-CoV-2 (Grubaugh et al., 2019). However, as the epidemic grows, our sample size of sequences will likely be so small relative to the total number of cases that it will be very difficult, if not impossible, to detect individual transmission chains.
Caution must therefore always be exercised when attempting to infer exact transmission events. As an aside, although coronaviruses likely have lower mutation rates than other RNA viruses because of an inherent capacity for some proof-reading activity due to a 3′-to-5′ exoribonuclease (Minskaia et al., 2006), their long-term rates of nucleotide substitution (i.e., of molecular evolution) fall within the distribution of those seen in other RNA viruses (Holmes et al., 2016).
This suggests that lower mutation rates are to some extent compensated by high rates of virus replication within hosts. Although there is no evidence that this capacity to mutate (common to RNA viruses) will result in any radical changes in phenotype – such as in transmissibility and virulence – as these only rarely change at the scale of individual disease outbreaks (Grubaugh et al., 2020), it is obviously important to monitor any changes in phenotype as the virus spreads.
In all likelihood, any drop in the number of cases and/or CFR of COVID-19 will likely be due to rising immunity in the human population and epidemiological context rather than mutational changes in the virus.
More information: Jonathan Pekar et al, Timing the SARS-CoV-2 index case in Hubei province, Science (2021). DOI: 10.1126/science.abf8003