The Institut Pasteur's bioinformatics research teams have been closely involved in the response to COVID-19, working to analyze the sequencing data produced worldwide. On January 29, 2020, the Institut Pasteur, which is responsible for monitoring respiratory viruses in France, was the first in Europe to sequence the whole genome of the coronavirus known as SARS-CoV-2. Since then, it has sequenced more than 200 genomes. The results are used in phylogeographic and phylodynamic research (respectively tracing the geographical spread of the virus and using phylogenetics to determine the dynamics of the outbreak). A phylogenetic analysis conducted by the Institut Pasteur on around a hundred genomes from samples collected from patients in France between January 24 and March 24, 2020 revealed several early introductions of SARS-CoV-2 without local transmission, emphasizing the efficacy of the measures taken to prevent the spread of the virus from symptomatic cases. Given the scale of the pandemic, it is important to remain cautious in the race for knowledge when it comes to the possible over-interpretation of scientific results.
On April 28, an article was published in the journal PNAS on phylogenetic network analysis of SARS-CoV-2 genomes. The article proposed a history of the pandemic and the existence of three subtypes affecting different populations. It was reproduced in the mainstream press, but the method used in the study had a number of shortcomings. In a collective response published in the same journal on May 7, a number of scientists pointed to the difficulty of rooting the pandemic with the current available data and emphasized the limitations of phylogeographic approaches given the sampling bias. Moreover, they highlighted the danger of over-interpreting results that are based on several hypotheses and limited data.
The original article by Forster et al. used a phylogenetic network method that is widespread in human genetics because it can incorporate recombination, but is very rarely used in molecular epidemiology. Virus recombination does occur, for example in HIV and coronaviruses in general, but with SARS-CoV-2, which only emerged approximately six months ago, no recombination in human hosts has been observed. Phylogenetic trees, rather than networks, would therefore seem to be the most suitable approach. Another particularity of the method employed by Forster et al. is that it is an exploratory method that provides a visual representation of data, rather than an inferential method that can be used to test a hypothesis and confirm or reject it with a certain degree of statistical confidence. So the scientists' results are primarily interpreted visually, and there are no indicators as to their robustness. A further limitation of the study is the very small number of genomes used (160), despite the fact that on the day the article was published there were 6,000 genomes available. At the very least, the analysis should have been performed a second time with different samples to verify the stability of the results. Nevertheless, the article was indeed published, with many interpretations, especially the existence of three subtypes: A, "the ancestral type" from bats and pangolins; B, "derived from A" and "adapted to a large section of the East Asian population"; and C, derived from B and affecting Europe and America. These interpretations were subsequently reported in several newspapers in the United States.
This over-interpretation of a small-scale study (160 genomes out of the thousands available) provoked a reaction from the international community. A group of nearly 40 international scientists wrote a letter, published in the same journal around 10 days after the article itself (two other letters were published the same day on the same topic). The letter reveals two major shortcomings in the study by Forster et al.:
- The difficulty of rooting the evolutionary history of the pandemic based on the sequences currently available. The human sequences have barely evolved since December 2019, when the first samples were taken. There are one to two mutations each month for an RNA genome that is around 30,000 nucleotides long. The sequences that are furthest removed from the very first Chinese sequences have around 25 mutations at the most. Conversely, the closest animal virus genome, from bats, has approximately 1,200 differences compared with the human virus. Given the random appearance of the mutations, we cannot say whether a given human sequence is significantly closer to the bat sequence and therefore represents THE ancestral sequence, as affirmed by Forster et al. This difficulty is compounded by the fact that the most frequent sequence found in China in December was also found perfectly conserved in countries including Taiwan, Japan, the United States and the United Kingdom until recently. So it is difficult to root the pandemic and identify its geographical origin. It would undoubtedly be possible to do so based on the course of events and documented cases of transmission, but much caution is needed.
- The sampling bias. In the study by Forster et al. there were a large number of Chinese sequences, but very few Italian sequences, for example. Today nearly half of the public sequences come from the United Kingdom. In both cases, the representations are biased, and phylogeographic methods are sensitive to this bias. They work on the basis of ancestral reconstruction: typically the root of a subtree for which the vast majority of leaves come from a given country will be assigned to that country. And these ancestral reconstructions are inevitably sensitive to representation bias. If we now perform a naive analysis with half of the sequences coming from the United Kingdom, we will conclude that the pandemic "clearly" comes from the United Kingdom. It is not easy to compensate for this bias, but a robustness study is needed to ensure that the results are not overly sensitive to variations in geographical representations.
Phylogenetic analyses provide a wealth of information for the field of epidemiology. For SARS-CoV-2, they have been used by scientists to find the genomes of the closest animal viruses, to rule out the hypothesis that the virus was created by humans in a laboratory, and to obtain a broad idea of how the virus has moved between countries and continents over the earth's surface. But they are based on hypotheses (such as geographical prevalence) and incomplete data (for example the rarity of Italian sequences). So we need to bear these limitations in mind, make them clear in scientific articles and reiterate them to journalists working for mainstream press outlets. In this respect phylogenetics is no different from other scientific approaches, but caution is particularly crucial when it comes to sensitive topics like the COVID-19 pandemic.
Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable, Proc Natl Acad Sci USA, May 7th ,2020
Mavian C, Pond SK, Marini S, Magalis BR, Vandamme AM, Dellicour S, Scarpino SV, Houldcroft C, Villabona-Arenas J, Paisie TK, Trovão NS, Boucher C, Zhang Y, Scheuermann RH, Gascuel O, Lam TT, Suchard MA, Abecasis A, Wilkinson E, de Oliveira T, Bento AI, Schmidt HA, Martin D, Hadfield J, Faria N, Grubaugh ND, Neher RA, Baele G, Lemey P, Stadler T, Albert J, Crandall KA, Leitner T, Stamatakis A, Prosperi M, Salemi M.