Microbial evolutionary genomics

The evolutionary roles of repeats

Stress is often faced in nature, whether from lack of nutrients, arrival of competitors or predators, or the presence of toxic substances. The survival of stressed organisms can be facilitated by the induction of a stress response that is specific to the individual stressor (or the class of stress encountered), by a general stress response or by generating variability allowing the recovery of an adaptive change. Dedicated stress responses are stably kept only if the stresses they tackle are sufficiently frequent. Not all stresses can be tackled in a deterministic way. The reaction to the stress must sometime include a stochastic element allowing the generation of a range of responses. In such cases, adaptation depends on the development of strategies aimed at generating appropriate stochastic variability required for natural selection. Much of this variability is generated by intra-chromosomal recombination between repeated DNA sequences. This is why I’ve been studying the presence and evolution of repeats in bacterial genomes. Recombination between repeats has been shown to play a major role in the host-parasite association between humans and bacterial pathogens. Recombination in the genomes of the latter allows the variation of proteins that are targeted by the immune system, and to exploit different host polymorphisms and tissues. Interestingly, the immune system also uses recombination to generate variability allowing it to counteract the action of pathogens. Thus, the arms race between bacteria and the immune system leads to stress in both organisms, which is partly tackled in a similar way, by stimulating recombination capable of generating adaptive changes. Our work has focused on census of repeats in genomes, on the role of repeats in sequence variation in pathogens and in the role of repeats in generating mutator genotypes.

Repeats census in prokaryotic genomes

There are many repeats large enough to engage in homologous recombination in most (but not all) bacterial genomes (Figure 1). We have searched for these in over 700 genomes of prokaryotes with a set of tools that were recently put together in the programs Repseek and Repeatoire. These analyses have revealed the immense diversity of repeats in genomes, from those created by selfish elements to the ones used for protection against selfish elements, from those arising from transient gene amplifications to the ones leading to stable duplications. Experimental works have shown that some repeats do not carry any adaptive value, while others allow functional diversification and increased expression. All repeats carry some potential to disorganize and destabilize genomes. Since recombination and selection for repeats vary between genomes, the number and types of repeats are also quite diverse and in line with ecological variables, such as host-dependent associations or population sizes, and with genetic variables, such as the recombination machinery. From an evolutionary point of view repeats represent both opportunities and problems. We have therefore described how repeats are created and how they can be found in genomes.

Figure 1- Identification of repeats classed as IS, Phage, rDNA and intergenic regions (IG).  50% of the identified repeats could not be confidently classified and thus are absent from the graph. (a) Repeats frequency. (b) Average size. (c) Repeat Coverage. Repeat coverage is the fraction of the summed lengths of all repeats in a given category (from Treangen, FEMS Mic rev, 08).

Intragenic repeats and protein quaternary structures

The biologically active state of many proteins requires their prior homo-oligomerisation. Such complexes are typically symmetric, a feature which has been proposed to increase their stability and facilitate the evolution of allosteric regulation. We wished to examine the possibility that similar structures and properties could arise from genetic amplifications leading to internal symmetric repeats. For this we identified internal structural repeats in a non-redundant PDB subset using Swelfe. While testing if repeats in proteins tend to be symmetric, we find that around half of the large internal repeats are symmetric, most frequently around a rotation axis of 180°. These repeats were most likely created by genetic amplification processes because they show significant sequence similarity. Symmetric repeats tend to have a fixed number of copies corresponding to their rotational symmetry order, i.e. 2 for 180° rotation axis, whereas asymmetric repeats are in longer proteins and show copy-number variability. When possible, we confirmed that proteins with symmetric repeats folding as an n-mer have homologs lacking the repeat with a higher oligomerisation number corresponding to the rotation symmetry order of the repeat. Phylogenetic analyses of these protein families suggest that typically, but not always, symmetric repeats arise in one single event from proteins that are homo-oligomers. These results suggest that oligomerisation and amplification of internal sequences can interplay in evolutionary terms because they result in functional analogues when the latter exhibit rotational symmetry.

Figure 4: Example of two proteins with a 2-fold symmetry and their homologues lacking the repeat.a: superimposition of 1ddz and 1i6o; b: 1i6o, beta carbonic anhydrase from Escherichia coli (chains in blue, slate, forest and limegreen); c:1ddz, beta-carbonic anhydrase from Porphyridium purpureum (chains in yellow and orange); d: superimposition of 1h9m and 1fr3; e: 1fr3, molybdate/tungstate binding protein from Sporomusa ovata (chains in blue, slate, forest, limegreen, cyan and greencyan); f: 1h9m, molybdate-binding-protein from Azotobacter vinelandii (chains in yellow, yelloworange and orange)

Evolution of transposable elements

Insertion sequences (ISs) are the smallest and most frequent transposable elements in prokaryotes where they play an important evolutionary role by promoting gene inactivation and chromosome rearrangements. Their genomic abundance varies by several orders of magnitude for reasons largely unknown and widely speculated. We thus used genome data to test many of the previously proposed hypotheses, notably that IS abundance correlates with the frequency of horizontal gene transfer, genome size, pathogenicity, non-obligatory ecological associations and human-association. We re-annotated ISs in 262 prokaryotic genomes and tested these hypotheses showing that when using appropriate controls, there is no empirical basis for IS-family specificity, pathogenicity or human-association to influence IS abundance or density. Horizontal gene transfer seems necessary for the presence of ISs, but cannot alone explain the absence of ISs in more than 20% of the organisms, some of which showing high rates of horizontal gene transfer. Gene transfer is also not a significant determinant of the abundance of IS elements in genomes, suggesting that IS abundance is controlled at the level of transposition and ensuing natural selection and not at the level of infection. Prokaryotes engaging in obligatory associations have fewer ISs when controlled for genome size, but this may be caused by some being sexually isolated. Surprisingly, genome size is the only significant predictor of IS numbers and density. Alone, it explains over 40% of the variance of IS abundance. Since we find that genome size and IS abundance correlate negatively with minimal doubling times we conclude that selection for rapid replication cannot account for the few ISs found in small genomes. Instead, we show evidence that IS numbers are controlled by the frequency of highly deleterious insertion targets. Indeed, IS abundance increases quickly with genome size, which is the exact inverse trend found for the density of genes under strong selection such as essential genes. Hence, for ISs, the bigger the genome the better.

Intra-species variability of gene repertoires

The availability of 20 E. coli genomes and a close outgroup allowed for the first time to reconstruct the evolutionary events within the species. We showed that only ~2000 genes are present in all 20 genomes even though the average genome contains ~4600 genes. Thus, the vast majority of the 18 000 orthologues families in the set are present in some, but not all, E. coli genomes. The average acquired DNA fragment in E. coli contains 4.3 genes whereas the losses average to only 3 genes. Therefore, gains correspond to larger fragments and losses to more frequent events. The inference of ancestral genomes allows the explicit inclusion of time in the analysis, make simultaneous analyses of multiple genomes and thus precisely time the introduction/elimination of genetic information along the evolution of the lineages. An example is the recurrent loss or gain of genes in E. coli genomes. The reconstruction of the ancestral genomes allowed separating genes that are lost only within one group of non-monophyletic lineage sharing phenotypic traits. For example, we were able to show that genes associated with metabolism were systematically lost in parallel in the different lineages leading to the Shigella.

Figure 3–Frequency of pan genome genes within the 20 E. coli genomes included in our analysis. At one extreme of the x-axis one finds the genes present in a single genome which are regarded as strain specific genes (9,054 genes: 51% of the pan-genome), while at the opposite end of the scale, are situated the genes found in all 20 genomes representing the E. coli core-genome (1976 genes: 11% of the Pan-Genome). Coloured rectangles represent the proportion of IS-like elements (yellow), prophage-like elements (green), unknown/unclassified function (white). Black rectangles represent genes for which one can assign a function (from Touchon, Plos Genetics, 09).

We find that less than half of E. coli genes are present in the first 20 sequenced genomes. E. coli K12 and S. enterica typhimurium have a divergence time estimated at around one million years, i.e. 108 generations and a rearrangement rate higher than 10-4/generation. Yet, the relative order of the orthologous genes is practically identical. How can such high genome dynamics result in an organised genome? While 51% of the locations between contiguous core genes show no single insertion or deletion in any of the 21 genomes, we found 133 locations with an average of more than 5 non-core protein coding genes per genome. These locations accumulate 71% of all non-core pan-genome. This analysis revealed that in most genomes gene acquisition and loss takes place at precisely the same locations, i.e. between the same two contiguous genes of the core genome. Therefore, hotspots correspond to regions of abundant and parallel insertions and deletions of genetic material. While the existence of large insertions and deletions in E. coli has been abundantly described, our data shows that these events take systematically place at the same regions in different genomes.

What leads to such hotspots? We found that 83% of the hotspots showed no tRNA at the edge of the element and most also lack integrases. Nearly two thirds of the hotspots (62%) lack prophages in all genomes. This seriously challenges the widely held view that E. coli integration hotspots are mostly determined by the bias of phage-like integrases to insert at tRNAs. What else could create such hotspots? Selection for the integrity of genetic elements, e.g. genes, and for genome organisation, e.g. operons, reduces the number of locations where large insertions can occur without significant loss of fitness. Once a permissive region acquires a large element, and since most transferred DNA has no adaptive value, subsequent integration in the region becomes more likely because the region offers a larger target for neutral insertion. The insertion of a large element in a permissive region will then result in a founder effect that amplifies the likeliness of a permissive region to become a hotspot.

Figure 4 - Number of genes (ranging from 0 to 200) in indels along modern strains considering the ancestral gene order of the core genome. The numbers in the x-axis represent the order of genes in the core genome, which has the same order as E. coli K-12 MG1655 (from Touchon, Plos Genetics, 09).

Relevant references from the lab:

Abraham, AL, J Pothier, EP Rocha. 2009. An alternative to homo-oligomerisation: The creation of local symmetry in proteins by internal amplification. J Mol Biol. 394:522-34
Abraham, AL, EP Rocha, J Pothier. 2008. Swelfe: a detector of internal repeats in sequences and structures. Bioinformatics 24:1536-1537.
Achaz, G, F Boyer, EPC Rocha, A Viari, E Coissac. 2007. Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics 23:119-121.
Achaz, G, EPC Rocha, P Netter, E Coissac. 2002. Origin and fate of repeats in bacteria. Nucleic Acids Res 30:2987-2994.
Iverson-Cabral, SL, SG Astete, CR Cohen, EP Rocha, PA Totten. 2006. Intrastrain heterogeneity of the mgpB gene in Mycoplasma genitalium is extensive in vitro and in vivo and suggests that variation is generated via recombination with repetitive chromosomal sequences. Infect Immun 74:3715-3726.
Rocha, EPC. 2003. An appraisal of the potential for illegitimate recombination in bacterial genomes and its consequences: from duplications to genome reduction. Genome Res 13:1123-1132.
Rocha, EPC, A Danchin, A Viari. 1999a. Analysis of long repeats in bacterial genomes reveals alternative evolutionary mechanisms in Bacillus subtilis and other competent prokaryotes. Mol Biol Evol 16:1219-1230.
Rocha, EPC, A Danchin, A Viari. 1999b. Functional and evolutionary roles of long repeats in prokaryotes. Res. Microbiol. 150:725-733.
Rocha, EPC, I Matic, F Taddei. 2002. Over-representation of close repeats in stress response genes: a strategy to increase versatility under stressful conditions? Nucleic Acids Res 30:1886-1894.
Rocha, EPC, O Pradillon, H Bui, C Sayada, E Denamur. 2002. A new family of highly variable proteins in the Chlamydophila pneumoniae genome. Nucleic Acids Res 30:4351-4360.
Touchon, M, C Hoede, O Tenaillon, et al. 2009. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet 5:e1000344.
Touchon, M, EP Rocha. 2007. Causes of insertion sequences abundance in prokaryotic genomes. Mol Biol Evol 24:969-981.
Treangen, TJ, AL Abraham, M Touchon, EP Rocha. 2009a. Genesis, effects and fates of repeats in prokaryotic genomes. FEMS Microbiol Rev 33:539-571.
Treangen, TJ, AE Darling, G Achaz, MA Ragan, X Messeguer, EPC Rocha. 2009b. A novel heuristic for local multiple alignment of interspersed DNA repeats. IEEE/ACM TRANS COMPUT BIOL BIOINF 6:180-9.