New York City’s first confirmed COVID-19 cases stemmed primarily from European and United States sources, according to the first molecular epidemiology study of SARS-CoV-2 from researchers at the Icahn school of Medicine at Mount Sinai including CUNY SPH Ph.D. student Brianne Ciferri.
The study, which was published Friday in Science, is the first to trace the source of these cases and show that the SARS-CoV-2 outbreak in New York City arose mostly through untracked transmission between the United States and Europe, with limited evidence to support any direct introductions from China, where the virus originated, or other locations in Asia.
The researchers also documented early community spread of SARS-CoV-2 in New York City during that time.
New York City has become one of the major epicenters of SARS-CoV-2 infections in the U.S. with nearly 17,000 fatalities in the metropolitan area. Knowing when the virus came to New York and the route it took is critical for evaluating and designing containment strategies.
The research team sequenced the virus causing COVID-19 in patients seeking care at one of the hospitals of the Mount Sinai Health System. Phylogenetic analysis of 84 distinct SARS-CoV2 genomes indicated multiple, independent but isolated introductions mainly from Europe and other parts of the United States.
Clusters of related viruses found in patients living in different neighborhoods suggested that community spread was already underway by March 18.
“Our study provides unexpected insights into the origin and diversity of this new viral pathogen,” Ciferri says. “We found clear evidence for multiple independent introductions into the larger metropolitan area from different origins in the world as well as the U.S..
Additionally, we identified strain clusters in different neighborhoods across the city, suggesting that untracked community transmission was already underway prior to March 18.
Our findings highlight the crucial need for early public health response in the event of a novel emerging pathogen. Hopefully, the evidence we have uncovered concerning the early spread and introduction into what became the national epicenter, will serve as guidance for future public health efforts in early stages of pandemic response.”
The researchers found that there were over 700 mutations, of which almost two-thirds resulted in a change in the amino acid sequence of the protein. The rest were in the intergenic regions. There were 39 non-synonymous mutations with prevalence more than 0.06%, or at least 20 of the analyzed genomes.
These mutations were found in 6 genes, namely, replicase polyprotein (ORF1ab), spike protein, membrane glycoprotein, nucleocapsid phosphoprotein, ORF3, and ORF8. The most significant number of non-synonymous mutations was in the ORF1ab gene, which encodes 16 non-structural proteins.
Among these, NSP3, NSP12, and NSP2 have a high number of mutations, numbering 117, 61, and 61, respectively. The gene itself displays over half of the frequent mutations, with 22 mutations in the RNA-dependent RNA polymerase, helicase, proteinase, endo-RNAase, exonuclease, and transmembrane domains. Replication errors must be corrected rapidly and accurately, and both NSP2 and NSP3 are required for this to happen.
There were ten hotspot mutations at hypervariable domains, found at a frequency of over 0.10. One especially frequent mutation was the D614G mutation within the gene encoding the spike protein in 44% of genomes. Another major hotspot mutation was the L84S at ORF8, in 32%. Four of them were in the ORF1ab gene represented in 11% to 17% of the genomes in each case.
Mapping the geolocations
Only about 100 of the large number of genomes analyzed were wildtype, mostly of Chinese origin. Still, the mutant virus genomes came from all over, being seen in almost 3,000 strains with varying genotypes.
The highest number of mutations was in the USA, with 316 mutations. This included US-specific singleton mutations (occurring only once in a population), seen in a quarter of all the mutations, while Chinese mutations accounted for half this number. Almost every American genome had one or more of seven mutations.
The singleton mutations result from the single strain that diverged from the original strain as a result of environmental, host, and serial passage factors, because of the inaccuracies introduced by the reverse transcriptase enzyme.
Among the 59 countries that contributed to mutant genomes, 26 had singleton mutations. Most of the genomes had multiple mutations.
Three of these mutations were found on every continent, namely the G251V (in ORF3a), L84S (in ORF8), and S5932F (in ORF1ab), except Africa and Australia. On the other hand, there were 3 others (F924F, L4715L (in orf1ab), and D614G (in spike) as well as an intergenic variant that was present in all except Asian strains.
Again, common mutations were observed in Algerian and European strains, as in European and Dutch genomes, which showed ten recurrent mutations. African and Australian genomes shared mutations at four positions, and two positions by Asian genomes.
The most significant variability was seen in Australia, New Zealand, and the US.
Tracking mutations over time
The researchers saw a constant rate of accumulation of mutations over time, but the strains collected last showed a small increase compared to the rest. On the other hand, more mutations appeared at the end of January and in early April. The mutations with the highest frequency were seen in late February for the first time.
Phylogenetic tracing
When the mutations were used to align the viral strains phylogenetically, 3 clades were distinguished, with several closely related strains being found in different countries. This can be used to identify how and when the viral transfers occurred, as well as routes for spread. The phylogenetic tree also shows that the virus reached the US by multiple routes multiple times, with the first introduced genome being similar to the strain that caused the second wave of cases in China.
Selective pressure
The researchers found that the ORF1ab gene was subject to selective pressure due to the high rate of mutations. The spike protein gene also showed the same phenomenon. In both cases, purifying selection was apparent, as indicated by the analysis.
There were 8 sites with negative selection pressure and 3 with positive selection pressure in the ORF1ab gene. With the spike gene, there were 7 and 1 sites under negative and positive selective pressure.
Modeling shows a single negatively selected site on the receptor-binding domain, indicating a lack of strong selective pressure on this part of the genome.
Analyzing genome variation within and between species
The researchers built a pan-genome from the almost 1,200 protein sets encoded in the publicly available 115 genomes on the NCBI website. Of these, 83 genomes belonged to the SARS-CoV-2.
There were 94 clusters of proteins, of which ten were shared between the SARS-CoV-2 and three other beta coronaviruses – the SARS-CoV and two bat CoV.
How are mutations important?
Mutations generate variation in the genome, allowing viruses to evade host defenses and antiviral drug targets. The SARS-CoV-2 is relatively slow to mutate, which may make it easier to develop effective vaccines.
Mutations in the endosome-associated-protein-like domain of the NSP2 protein may make the novel coronavirus more easily transmissible than earlier epidemic viruses from this virus.
The frequency of recurrent and non-synonymous mutation in the non-structural proteins NSP12 to NSP15 that are essential for the correction of virus replication errors may present difficulties in developing vaccines based on these genes that are potential targets.
In most situations, the genomic variation causes an increase in viral spread and ability to cause disease, due to the accumulation of mutations that increase the virulence of the virus. Spike mutations may present as changes in pathogenicity, with the V367F mutations, for instance, causing enhanced affinity of the protein with the ACE2 receptor.
Moreover, the study of the genomic variation among strains allows the occurrence of the mutation over time and place to be visualized. The current findings, for instance, show that the distribution of single nucleotide polymorphisms (SNPs) is not random, but dominates in those genes that are essential for the virus.
Co-occurring mutations are also common. The ‘founder mutation’ that arose in the US gave rise to multiple singleton mutations. On the other hand, many specific mutations are found in the strains circulating in Spain, Italy, and the US, accounting for the high rate of rapid spread and the severity of illness.
The negative selection site at the Mac1 domain on NSP3 is not essential for RNA replication but may be required for immune evasion. It could also be involved in viral replication in the presence of a host influence.
Negatively selected sites could be a drag on viral functioning, which indicates their usefulness in drug or vaccine design, since these are more likely to be conserved and hence persist unchanged.
More information: Ana S. Gonzalez-Reiche et al. Introductions and early spread of SARS-CoV-2 in the New York City area, Science (2020). DOI: 10.1126/science.abc1917