NA “echoes” of viruses that infected our ancestors millions of years ago could help the immune system to identify and kill cancer cells, according to new research from Crick scientists.
The new study, published in Genome Research, looked at “endogenous retroviruses,” fragments of DNA in the human genome that were left behind by viruses that infected our ancestors.
Over millions of years, our ancestors were infected with countless viruses and their DNA now makes up more of our genome than human genes. Approximately 8 percent of the human genome is made up of retroviral DNA, while known genes only make up 1-2 percent.
“This viral DNA typically lies dormant, as it is either non-functional or our bodies have evolved to suppress it,” explains Crick Group Leader Dr. George Kassiotis, who led the study. “However, when a cell becomes cancerous, some of these suppression mechanisms can fail and this ancient viral DNA can be reactivated. In this study, we looked for viral DNA that is reactivated by cancer and produces products that the immune system can see. The hope is that if we can train the immune system to spot these, we can selectively target cancer cells.”
Reawakening ancient DNA
Genes are pieces of DNA that contain instructions to produce proteins, which perform important functions in the cell or the body. These instructions are transcribed into RNA “messenger” molecules before the proteins are produced. However, this transcription process can be influenced by DNA outside the gene, including endogenous retroviruses.
To study the effects of endogenous retroviruses on transcription, the team looked at patient samples from 31 different cancer types using a technology called “RNASeq’ that can read short, random fragments of RNA. However, as each “read” only delivers a small part of the sequence in an unknown order, it takes up to 50 million “reads” per sample to build a complete picture of transcriptional activity.
“Piecing together a full transcriptional profile is a monumental task,” says George. “It’s been likened to trying to read a magazine that’s been shredded into millions of pieces, when you don’t even know what the magazine was supposed to be about or what language it’s in. All you have is random fragments, so to piece them together you need to see where they overlap.”
The team used RNA sequencing data from 768 patient samples, with almost 40 billion reads to piece together. Even using sophisticated algorithms, a desktop computer would need to run constantly for 24 years to stitch this data together. To speed things up significantly, the researchers turned to the Crick’s specialist Scientific Computing team. Running the analysis on the in-house High Performance Computing cluster, they got results far quicker.
Closing in on cancer
From the full transcriptional data, the team developed a catalog of over 130,000 different RNA transcripts produced by endogenous retroviruses, more than half of which had not been previously discovered. Of these, there were roughly 6,000 transcripts that were specifically found in cancer samples and not healthy tissue. Many of these were specific to the type of cancer, with most individual cancers expressing high levels of a few hundred transcripts.
“We focused on melanoma-specific transcripts and applied an algorithm to predict which could code for material that is visible to the immune system,” explains George. “We found 14 candidate transcripts from 8 different regions of the genome that could produce unique cancer antigens. Together with the Proteomics team at the Crick and Nicola Ternette’s lab in Oxford, we inspected mass spectrometry data to see which of these antigens were present in real patient samples. This narrowed it down to nine unique peptides that could be visible to the immune system. We hope this approach could form the basis of future cancer therapies, if we can vaccinate the immune system to recognize and attack cancer cells presenting these peptides.”
George is one of the scientific co-founders of Ervaxx, a spin-out company from the Crick that aims to take this science into the clinic to help patients. Ervaxx is expanding on these foundational insights to create a pipeline of off-the-shelf, cancer-specific vaccines and other immunotherapies, including a lead product candidate focused on treating melanoma patients.
All cellular organisms have double-stranded DNA genomes. The origin of DNA and DNA replication mechanisms is thus a critical question for our understanding of early life evolution.
For some time, it was believed by some molecular biologist that life originated with the appearance of the first DNA molecule!1 Watson and Crick even suggested that DNA was possibly replicated without proteins, wondering “whether a special enzyme would be required to carry out the polymerization or whether the existing single helical chain could act effectively as an enzyme”.2
Such extreme conception was in line with the idea that DNA was the aperiodic crystal predicted by Schroedinger in his influential book “What’s life”.3
Times have changed, and several decades of experimental work have convinced us that DNA synthesis and replication actually require a plethora of proteins.4 We are reasonably sure now that DNA and DNA replication mechanisms appeared late in early life history, and that DNA originated from RNA in an RNA/protein world.
The origin and evolution of DNA replication mechanisms thus occurred at a critical period of life evolution that encompasses the late RNA world and the emergence of the Last Universal Cellular Ancestor (LUCA) to the present three domains of life (Eukarya, Bacteria and Archaea.).5–7
It is an exciting time to learn through comparative genomics and molecular biology about the details of modern mechanisms for precursor DNA synthesis and DNA replication, in order to trace their histories.Go to:
Origin of DNA
DNA can be considered as a modified form of RNA, since the “normal” ribose sugar in RNA is reduced into deoxyribose in DNA, whereas the “simple” base uracil is methylated into thymidine.
In modern cells, the DNA precursors (the four deoxyribonucleoties, dNTPs) are produced by reduction of ribonucleotides di- or triphosphate by ribonucleotide reductases (fig. 1).
The synthesis of DNA building blocks from RNA precursors is a major argument in favor of RNA preceding DNA in evolution.
The direct prebiotic origin of is theoretically plausible (from acetaldehyde and glyceraldehyde-5-phosphate) but highly unlikely, considering that evolution, as stated by F. Jacob, works like a tinkerer, not an engineer.8,9

The first step in the emergence of DNA has been most likely the formation of U-DNA (DNA containing uracil), since ribonucleotide reductases produce dUTP (or dUDP) from UTP (or UDP) and not dTTP from TTP (the latter does not exist in the cell) (fig. 1). Some modern viruses indeed have a U-DNA genome,10 possibly reflecting this first transition step between the RNA and DNA worlds. The selection of the letter T occurred probably in a second step, dTTP being produced in modern cells by the modification of dUMP into dTMP by thymidylate synthases (followed by phosphorylation).11 Interestingly, the same kinase can phosphorylate both dUMP and dTMP.11 In modern cells, dUMP is produced from dUTP by dUTPases, or from dCMP by dCMP deaminases (fig. 1).11 This is another indication that T-DNA originated after U-DNA. In ancient U-DNA cells, dUMP might have been also produced by degradation of U-DNA (fig. 1).
The origin of DNA also required the appearance of enzymes able to incorporate dNTPs using first RNA templates (reverse transcriptases) and later on DNA templates (DNA polymerases). In all living organisms (cells and viruses), all these enzymes work in the 5′ to 3′ direction. This directionality is dictated by the cellular metabolism that produces only dNTP 5′ triphosphates and no 3′ triphosphates. Indeed, both purine and pyrimidine biosyntheses are built up on ribose 5 monophosphate as a common precursor. The sense of DNA synthesis itself is therefore a relic of the RNA world metabolism. Modern DNA polymerases of the A and B families, reverse transcriptases, cellular RNA polymerases and viral replicative RNA polymerases are structurally related and thus probably homologous (for references, see a recent review on viral RNA-dependent RNA polymerases.)12 This suggests that reverse transcriptase and DNA polymerases of the A and B families originated from an ancestral RNA polymerase that has also descendants among viral-like RNA replicases. However, there are several other DNA polymerase families (C, D, X, Y) whose origin is obscure (we will go back to this point below).
If DNA actually appeared in the RNA world, it was a priori possible to imagine that formation of the four dNTPs from the four rNTPs was initially performed by ribozymes. Most scientists, who consider that the reduction of ribose cannot be accomplished by an RNA enzyme, now reject this hypothesis.9,13–19 The removal of the 2′ oxygen in the ribose involves indeed a complex chemistry for reduction that requires the formation of stable radicals in ribonucleotide reductases. Such radicals would have destroyed the RNA backbone of a ribozyme by attacking the labile phosphodiester bond of RNA. Accordingly, DNA could have only originated after the invention of modern complex proteins, in an already elaborated protein/DNA world. This suggests that RNA polymerases were indeed available at that time to evolve into DNA polymerases (as well as kinases to phosphorylate dUMP).
Three classes of ribonucleotide reductases (I, II and III) have been discovered so far (for a review, see refs. 9, 16-19) (fig. 1). Although they correspond to three distinct protein families, with different cofactors and mechanisms of action, these mechanisms are articulated around a common theme (radical based chemistry). In all cases, the critical step is the conversion of a cysteine residue into a catalytically essential thiol radical in the active center.18 Recent structural and mechanistic analyses of several RNR at atomic resolution have suggested that all ribonucleotide reductases originated from a common ancestral enzyme, favoring the idea that U-DNA was invented only once.17,18 It has been suggested that either class III (strictly anaerobic) or class II (anaerobic but oxygen tolerant) represent the ancestral form, and that new versions appeared in relation to different lifestyles by recruiting new mechanisms for radical activation (class III in strict anaerobes and class I in aerobes).9,18
The origin of U-DNA in a protein/RNA world logically implies that the second step in the synthesis of DNA precursors, the formation of the letter T, was catalyzed by ancestral thymidylate synthase. For a long time, it was believed that modern thymidylate synthases were all homologues of E. coli ThyA protein, indicating that the letter T was invented only once. However, comparative genomics has revealed recently that ThyA is absent in many archaeal and bacterial genomes, leading to the discovery of a new thymidylate synthase family (ThyX).19 ThyX and ThyA share neither sequence nor structural similarity between each other and have different mechanisms of action,19,20 indicating that thymidylate synthase activity was invented twice independently (fig. 1). T-DNA might have appeared either in two different U-DNA cells, or the invention of a second thymidylate synthase might have occurred in a cell already containing a T-DNA genome. The first possibility would indicate that T-DNA itself has been invented twice, thus suggesting a strong selection pressure to select for uracil modification. In the second case, one should imagine that the new enzyme (either ThyA or ThyX) brought a selective advantage over the previous one in the organism where it appeared first.
A major question is why was DNA selected to replace RNA? The traditional explanation is that DNA replaced RNA as genetic material because it is more stable and can be repaired more faithfully.4 Indeed, removal of the 2′ oxygen of the ribose in DNA has clearly stabilized the molecule, since this reactive oxygen can attack the phophodiester bond (this explains why RNA is so prone to strand breakage). In addition, the replacement of uracil by thymine has made possible to correct the deleterious effect of spontaneous cytosine deamination, since a misplaced uracil cannot be recognized in RNA, whereas it can be pint-pointed as an alien base in DNA and efficiently removed by repair systems. Replacement of RNA by DNA as genetic material has thus opened the way to the formation of large genomes, a prerequisite for the evolution of modern cells.
The above scenario nicely explains why, through Darwinian competition, cell populations with DNA genomes finally eliminated cells with RNA genomes. However, this does not explain why the first organisms with a modified RNA (DNA-U), and later on with T-DNA, were successfully selected against the wild type organisms of that time? Indeed, the possibility to have a large genome or to repair cytosine deamination could not have been realized in that individual. In both cases, efficient DNA repair (to remove uracil from DNA) and replication proteins able to replicate large DNA genomes should have evolved first in order for the cell to take advantage of the presence of DNA.15 To explain the origin of DNA, it is thus necessary to consider an advantage that could have been directly selected in the organism in which the transition occurred.
In order to solve this problem, it has recently been proposed that U-DNA first appeared in a virus, making this first U-DNA organism resistant to the RNAses of its host (fig. 2).6,7 Indeed, ribose reduction led to a drastic modification in the structure of the double helix (from the A to the B form) that explains why RNAses are usually inactive on DNA and DNAses inactive on RNA. Similarly, thymidylate synthase could have appeared later on in a virus with U-DNA, to makes its genome resistant to cellular U-DNAses (fig. 2). The same process would have lead to modifications observed in modern DNA viruses (further base methylation in many viral genomes or hydroxymethylation of cytosines in T-even bacteriophages). These modifications are clearly designed to protect viral DNA against host DNAses. Interestingly, thymidylate synthase of the ThyA family are homologous to the T-even bacteriophages DNA modification enzyme dCMP hydroxymethyl-transferases.21 Hydroxymethyl (HMC)-dCTP is directly incorporated into HMC-DNA by the viral polymerase (fig. 1).11 Restriction-modifications systems could be descendant of such viral mechanisms for genome protection; some of them being stolen later on by cells themselves.

If DNA replication and repair mechanisms also originated in viruses, it is easy to imagine that enzymes to correct cytosine deamination are of viral origin, and were later on transferred to cells, a prerequisite to understand the selective advantage of DNA cells over RNA cells in term of faithful replication (see a discussion of this problem in ref. 15). Several scenarios are possible for the transfer of a DNA genome from a virus to a cell: either a cell succeeded to capture several viral enzymes at once to change its genetic material from RNA to DNA, or a large DNA provirus, living in a carrier state inside an RNA cell, finally take over all functions of its host by retro-transcription, subsequently eliminating the labile RNA genomes.
The idea that viruses have played a critical role in the origin of DNA is in line with previous conception that retroviruses were relics of the RNA/DNA world transition.22 In particular, production of DNA from RNA genome in Hepadnavirus could reflect the ancient pathway leading from RNA to DNA.23 The invention of DNA by an RNA virus seems to be more likely than the invention of DNA by an RNA cell for protection against viral RNAses, because it has been probably easier for a virus, than for a cell, to change at once the chemical nature of its genome. This is exemplified by the fact that viruses have managed to multiply with very different types of genetic material (ssRNA, dsRNA, ssDNA, dsDNA, modified DNA) whereas, apart for localized methylation, all types of cells have the same kind of dsDNA genomes.
The hypothesis of a viral origin for DNA could explain why many DNA viruses encode their own ribonucleotide reductase and/or thymidylate synthase. This is usually interpreted as the recruitment of cellular enzymes by viruses, but, if DNA appeared in viruses, the opposite could be true as well. Many viral ribonucleotide reductases and thymidylate synthases branch far off from ribonucleotide reductases and thymidylate synthases of their hosts in phylogenetic trees, suggesting that the viral versions of these enzymes are indeed as ancient as their cellular versions (fig. 3A, 3B, 3C). Unfortunately, the direction of ancient transfer of these enzymes (either from cells to viruses or from viruses to cells) is difficult to determine, considering possible artifacts of long branch attraction that can be produced by differences in evolutionary rates between cellular and viral enzymes, i.e., viral sequences can be artificially separated from cellular ones because the latter have evolved more slowly and thus have conserved more common ancestral positions.

Figure 3A
Phylogenetic trees of the ribonucleotide reductases of Class I and II (A); type II DNA topoisomerase of the A family (B—left), thymidylate synthases of the ThyA family (B—right) and DNA polymerases of the B family (RNA-primed) (from ref. (more…)

Figure 3B
Phylogenetic trees of the ribonucleotide reductases of Class I and II (A); type II DNA topoisomerase of the A family (B—left), thymidylate synthases of the ThyA family (B—right) and DNA polymerases of the B family (RNA-primed) (from ref. (more…)

Figure 3C
Phylogenetic trees of the ribonucleotide reductases of Class I and II (A); type II DNA topoisomerase of the A family (B—left), thymidylate synthases of the ThyA family (B—right) and DNA polymerases of the B family (RNA-primed) (from ref. (more…)
As previously mentioned, there is also a striking evolutionary connection at the structural level between most viral RNA dependent RNA replicases and some modern DNA polymerases.12,24 Interestingly, an ancient origin of viral DNA replication mechanisms (possibly predating cellular ones) (fig. 2) would explain why enzymes involved in viral DNA replication are often very different from their cellular counterparts (see ref. 25 for the case of DNA polymerases) (see below for further discussion of this point).
These speculations on the origin of DNA fit well with hypotheses on viral origin that consider no longer viruses as fragments of genetic materials recently escaped from their hosts, but as ancient players in life evolution, possibly predating the divergence between the three domains of life.26,27 The idea that viruses originated before LUCA has been recently supported by the discovery of structural and/or functional similarities between viruses infecting different cellular domains of life, such as those detected between some archaeal viruses (Lipothrixvirus and Rudivirus) and several large eukaryal DNA viruses (Poxviruses, ASFV, Chlorella viruses),28 between Adenoviruses (eukaryal virus) and bacterial Tectiviruses,29 or between eukaryal Flavivirus and bacterial Cystoviruses.30
More information: Jan Attig et al. LTR retroelement expansion of the human cancer transcriptome and immunopeptidome revealed by de novo transcript assembly, Genome Research (2019). DOI: 10.1101/gr.248922.119
Journal information: Genome Research
Provided by The Francis Crick Institute