Microbial evolutionary genomics

Trade-offs between dynamics and organization

The stability of genomes results from a mutation-selection balance. As a result of different selection pressures, effective population sizes, and rearrangement rates, some genomes are significantly more stable than others. Rearrangements in Escherichia coli are very frequent in the lab, but few are found to have been fixed since the divergence from Salmonella enterica, about ~100 MY ago. This is usually thought to be the result of natural selection acting towards eliminating these deleterious events. Hence, the frequency of rearrangement events and the subsequent purging of deleterious ones by natural selection will determine the long-term conservation of gene order in bacterial genomes.

Measuring genome stability

Most characterizations of genome stability had relied on the comparisons of a small number of closely related genomes. This poses a problem if one wants to compare stability among very distant genomes, which is strictly necessary to test the association of stability with ecological and genetic aspects. There are currently two major methodological approaches to the study of gene order. One approach aims at determining the rearrangement distance between two genomes. This distance is an estimation of the number of rearrangement events that took place since divergence of the lineages and allows the analysis of evolutionary scenarios. However, it’s not yet adapted to the analysis of large sets of distant bacterial genomes. Hence, most large-scale analyses of gene order evolution used pairwise comparisons of gene order. Based on these, we have developed an index of genome stability that models the loss of gene order through time accounting for the organization of genomes into operons.

Gene order conservation (GOC) is defined as the probability that a pair of contiguous genes will have the corresponding orthologues also contiguous in an extant genome at a given evolutionary distance. GOC measures the similarity of gene order among orthologues of two genomes, i.e. after removing the effects of gene transfer and gene deletion. By analysing over a hundred genomes and by fitting different models, we found the best model to be the one where some pairs of genes are allowed to separate quickly, whereas others separate slowly (Figure 1). Presumably this results from different intensities of selection. Indeed, when one analyses pairs of genes inside and between contiguous operons, one finds that the former are much more conserved. Thus, at a local level the structure of genes in operons is the major constraint acting against fixation of genome rearrangements. Yet, even for pairs of genes between operons, some selection seems to take place. For example, for E. coli one observes a rearrangement rate leading to disruption of contiguity between genes in different operons three orders of magnitude lower than the laboratorial one.

Figure 1- Left. Non-linear regressions of GOC in function of the phylogenetic distance between pairs of genomes (from Rocha, Mol Biol Evol, 06). Pairwise phylogenetic distances were estimated using the 16S rRNA subunit sequences, using maximum likelihood under the HKY+Γ  model. Right. The similarity in terms of a given trait, e.g. gene order, between genomes decreases with certain characteristic shape as divergence time increases until a point where there is saturation of changes. If one of a set of genomes systematically deviates from the average trend this is indication of excessive conservation or divergence in terms of the trait. This may be associated with different rates of change in different genomes, e.g. different rearrangement rates, or with selection, e.g. different degree of selection for genome organization.

Each genome participates in a subset of the pairwise comparisons used to make the non-linear regression analyses. The stability of each genome is then defined as the average of the residuals resulting from the non-linear regression of the two-parameter model for the comparisons where the genome participates (Figure 1). Hence, the stability of a genome, e.g. B. subtilis, is the average of the residuals for all pairwise comparisons where B. subtilis participates. The stability thus calculated matches previous reports for simple pairwise comparisons between genomes.

Co-variates of stability

Lifestyle has often been associated with genome stability, with endomutualists being regarded as stable and pathogens as unstable. Using the abovementioned measure of stability one finds no significant differences between the endomutualists and the other bacteria. On the other hand, pathogenic bacteria were found to be slightly more, not less, stable than free-living bacteria. This brings to the fore the interest of comparative genomics to test evolutionary theories where anecdotic observations are sometimes taken for rules in the absence of hard data. We also found that genomes containing higher densities of repeats are less stable, as expected but not previously demonstrated.

Figure 2- Regression of the stability (named deviation in a previous version of the method (Rocha, Trends Genet, 03)) in function of putative gene order breakpoints (PGOB), i.e. the density of repeated elements.

Inverted repeats and genome stability

Since different positioning of repeats lead to different rearrangements, one might expect the distribution of repeats in genomes to be non-random. Although both direct and inverse repeats may generate sequence diversity by recombination, only the latter lead to chromosome inversions. Therefore, chromosome organisation can be partly reconciled with the existence of repeats, if these are mostly in the direct conformation. Indeed, in more than 75% of the genomes there are significantly more direct repeats than inverse repeats, and we found no genome with significantly more inverse than direct repeats. This is consistent with selection for the present for repeats leading to genetic variation in certain loci, counter-balanced with purifying selection on elements disrupting chromosomal organisation. This trade-off is particularly remarkable in the comparison of Mycoplasma genitalium and Mycoplasma pneumoniae. These genomes rely on homologous recombination to produce genetic variability, but because these genomes code 80% of the genes in the leading strand inversions are expected to be strongly counter-selected. Direct repeats outnumber inverse repeats by a factor larger than 9 in M. pneumoniae and more than 50 in M. genitalium. As a result, only translocations, but not inversions, are observed. Repeats also tend to be placed in the chromosomes around the origin of replication more symmetrically than expected. At what level such repeats are the cause and/or the consequence of frequent symmetric recombination remains to be determined. In any case, and as described above, this distribution of repeats produces less disorganising chromosome inversions. All these results suggest that the trade-off between the creative and disorganising consequences of repeats lead to biased repeat positioning in genomes.

Mechanisms and functions of homologous recombination

The current knowledge on the mechanisms of homologous recombination, as so many others in fundamental molecular microbiology, depends mostly on the work done in E. coli and, to a lesser extent, in B. subtilis. About 20 genes are known to be involved in homologous recombination in E. coli and around 12 in B. subtilis. Interestingly, the mechanisms acting on both Bacteria share many resemblances. As such, one would expect that the knowledge of homologous recombination mechanisms in these genomes could enlighten us about mechanisms operating in other genomes. Unfortunately, we have shown that this is not necessarily the case.

Homologous recombination follows two major pathways - RecBCD/AddAB and RecFOR - which both provide a 3’ terminated ssDNA molecule coated with RecA to allow the pairing of the heterologous DNA strands. We found that RecBCD/AddAB, which is responsible for a larger share of the recombination activity in E. coli, shows a narrower phylogenetic distribution. RecFOR is more widespread, but RecF is absent from many genomes. Surprisingly, many of the bacteria lacking elements of the RecBCD/AddAB and RecFOR pathways are known to engage into homologous recombination. Thus, while most genomes contain RecA and the elements required to the migration and resolution of the Holliday-junctions, many lack known proteins for the helicase/nuclease/RecA-coating activity. This suggests that some elements of the recombination machinery are still to be identified or that such genes are less necessary than commonly thought. Thus, systematic inference of the recombination capacity of a cell cannot be inferred solely from genome analysis.

Relevant references from the lab:

Achaz, G, E Coissac, P Netter, EPC Rocha. 2003. Associations between inverted repeats and the structural evolution of bacterial genomes. Genetics 164:1279-1289.
Fischer, G, EPC Rocha, F Brunet, M Vergassola, B Dujon. 2006. Highly variable rates of genome rearrangements between Hemiascomycetous yeast lineages. PLoS Genet 2:e32.
Rocha, EP. 2008. Evolutionary patterns in prokaryotic genomes. Curr Opin Microbiol 11:454-460.
Rocha, EPC. 2003. DNA repeats lead to the accelerated loss of gene order in Bacteria. Trends Genet 19:600-604.
Rocha, EPC. 2004. Order and disorder in bacterial genomes. Curr Op Microbiol 7:519–527.
Rocha, EPC. 2006. Inference and Analysis of the Relative Stability of Bacterial Chromosomes. Mol Biol Evol 23:513–522.
Rocha, EPC, A Blanchard. 2002. Genomic repeats, genome plasticity and the dynamics of Mycoplasma evolution. Nucleic Acids Res 30:2031-2042.
Rocha, EPC, E Cornet, B Michel. 2005. Comparative and Evolutionary Analysis of the Bacterial Homologous Recombination Systems. PLoS Genet 1:e15.
Rocha, EPC, A Danchin, A Viari. 1999. Functional and evolutionary roles of long repeats in prokaryotes. Res. Microbiol. 150:725-733.