A publication in Nature by the Institut Pasteur and the CNRS*, in association with researchers from South Africa, proposes a new phylogenetic bootstrap method. The paper was inspired by one of the most cited publications in the history of science (more than 35,000 citations), written by Joseph Felsenstein in 1985. Felsenstein described the first phylogenetic bootstrap technique, which for more than 30 years has proven to be extremely useful and relevant in a huge number of fields. With the emergence of big data in biology and high-throughput sequencing, however, the limitations of Felsenstein's method have become apparent, as it is often unable to reveal the signals contained in large datasets. The method proposed by Olivier Gascuel's team addresses this weakness. The article in Nature demonstrates the accuracy of the proposed method for large alignments of sequences from mammals and HIV and for simulated datasets.
*Center of Bioinformatics, Biostatistics and Integrative Biology, USR 3756, IP and CNRS (INSB, INS2I and INEE)
In the 1930s, Theodosius Dobzhansky was one of the first scientists to understand the links between Darwin's theories and genetics. In 1973, he published a famous essay entitled Nothing in Biology Makes Sense Except in the Light of Evolution. The emergence of big data from DNA sequencing has confirmed his predictions: evolutionary methods have become essential for studying and understanding biological entities at all levels, from molecular research, functional genomics and protein families right up to populations and ecosystems, and including efforts to understand and monitor disease outbreaks.
Evolutionary trees and phylogenetic reconstruction
A vital tool in this research is phylogenetic reconstruction. By using DNA or protein sequences, scientists are able to reconstruct evolutionary trees in the Darwinian sense, showing the lineage and genealogy of various biological entities, named "taxonomic units" or "taxa". "Homologous" sequences are descended from a common ancestral sequence. Variations in these sequences are used to reconstruct their history and evolutionary relationships, and by extension those of the taxa they belong to. Phylogenetic reconstruction methods have changed a great deal in the past 30 years, and we now have highly sophisticated algorithms to infer phylogenetic trees from large sets of homologous sequences. Phylogenetic algorithms and software are essential tools for bioinformatics, especially PhyML (Guindon and Gascuel 2003; Guindon et al. 2010), which has been cited more than 20,000 times (see Google Scholar).
The reliability of evolutionary tree branches
Once phylogenies have been reconstructed, the question arises as to their reliability. Which parts of the tree can be considered as certain, and which parts only reflect the noise inherent in any data? A similar situation arises with numerical estimations, for which error bars and confidence intervals are calculated. But here the challenge is greater because the estimation is in the form of a tree, a mathematical object that is much more complex than a simple numerical value. Joseph Felsenstein was the first to propose a useful method, inspired by the conventional statistical bootstrapping technique introduced and developed by Bradley Efron in the early 1980s. The technique involves generating variability by resampling the original data, then using these new pseudo-samples, or "bootstrap samples", to calculate new estimations, producing a distribution of estimated values rather than a single estimation that corresponds to the original sample.
The usefulness of Felsenstein's method
The approach proposed by Felsenstein is similar: the pseudo-samples (taken from the initial sequence set) are used to infer "bootstrap trees", which are compared to the original tree. Branches of the original tree that occur in a high proportion of bootstrap trees have a high level of statistical support, and conversely branches that occur rarely or not at all have low statistical support. The usefulness, simplicity and interpretability of this method led to its widespread use in evolutionary research, to such an extent that it is generally required for the publication of phylogenies. With 35,000 citations, Felsenstein's article is ranked in the top 100 most cited scientific papers of all time. In 2017, it was cited more than 2,000 times (see Google Scholar).
A weakness in Felsenstein's method in the big data era
However, recent high-throughput sequencing techniques frequently generate large datasets containing hundreds or thousands of sequences, and Felsenstein's bootstrap proportions (termed FBPs) tend to be very low for phylogenetic trees with a high number of taxa. The reason for this decrease can be explained by the very nature of Felsenstein's bootstrap technique. A branch on a bootstrap tree is only taken into account if it corresponds exactly with a branch of the original tree. A branch of a tree splits the taxa under study into two subsets situated on either side of the branch; two branches are considered as identical if they induce the same bipartition. In Felsenstein's method, a difference of just one taxon means that the bootstrap branch will not be included, despite being nearly identical to the original branch. The standard approach is to remove phylogenetically unstable taxa and restart the analysis. But this method is statistically questionable and computationally expensive. Moreover, with large trees, all branches are likely to be (slightly) erroneous, and a significant proportion of taxa may be relatively unstable.
A method to reveal the signals contained in large datasets
The paper published in Nature proposes a new version of the phylogenetic bootstrap, in which the presence of original branches in bootstrap trees is measured using a gradual "transfer" distance, unlike Felsenstein's technique, which is based on a binary presence/absence index. The transfer distance is standardized, then averaged over all the bootstrap trees. This in turn indicates the TBE (transfer bootstrap expectation) branch support, which by construction is higher than FBP supports. When combined with a statistically sound inference method, TBE very rarely supports highly erroneous branches. The results on sequence datasets from mammals and HIV, as well as simulated datasets, clearly demonstrate the usefulness of the approach, especially for deep branches and large trees, where branches known to be essentially correct are supported by TBE but not by FBP. TBE supports are easily interpreted as fractions of unstable taxa, and the ability of TBE to identify the most unstable taxa (such as recombinant HIV sequences) means that these taxa can be studied in more detail with a view to understanding why they are phylogenetically unstable and revising the branch supports and overall phylogeny.The method was implemented in a web server, and the C source code has been made available on an open source basis.
Renewing Felsenstein’s Phylogenetic Bootstrap in the Era of Big Data, Nature, April 26, 2018.
F. Lemoine1,2, J.-B. Domelevo Entfellner3,4, E. Wilkinson5, D. Correia1, M. Dávila Felipe1, T. De Oliveira5,6 & O. Gascuel1,7*1. Evolutionary Bioinformatics Unit, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France.
2. Bioinformatics and Biostatistics HUB, C3BI USR 3756, Institut Pasteur & CNRS, Paris, France.
3. Department of Computer Science, University of the Western Cape, Cape Town, South Africa.
4. South African MRC Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa.
5. KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa.
6. Centre for the AIDS Programme of Research in South Africa (CAPRISA), University of KwaZulu-Natal, Durban, South Africa.
7. Methods and Algorithms for Bioinformatics, LIRMM UMR 5506, University of Montpellier & CNRS, Montpellier, France.