Microbial evolutionary genomics

Genome organization

All cellular processes interacting directly or indirectly with DNA affect and shape genome structure. The underlying molecular cause is that such processes impose constraints and/or lead to selection of some favourable configurations of genomic objects. Naturally, if two processes interact in the chromosome then the affected regions will be constrained by the processes and their interaction, which requires fine-tuned organization. The resulting picture is that at the crossroads between interactions genomes become highly organized by processes such as transcription, replication and segregation. A classical example of the former is the organization of functionally related genes into operons. Longer-range organizational levels resulting from the dynamic interaction of the chromosome and gene expression with the cell factory involve the association between gene expression and cell compartmentalization, chromosome segregation or cell differentiation. In E. coli, genes involved in the sulphur metabolism cluster together, possibly to allow compartmentalization of metabolism itself and its toxic metabolic intermediates. Translation related genes also cluster at a supra-operonic level in many bacterial genomes and genomic islands frequently contain genes involved in pathogenicity and antibiotic resistance, but also symbiosis and metabolic pathways. This implies that a proper description of prokaryotic genome organisation must include supra-operonic organisation, and is consistent with the neighbourhood conservation of genes in different operons between distantly related genomes. Our major aim is to understand how the interaction between cellular processes shapes chromosome organisation.
 

Figure 1- Elements associated with the organization of the bacterial chromosome. Green and Red distinguish between the leading and the lagging strands. Ori and Ter identify the origin and terminus of replication, encircled arrows indicate the direction of replication fork progression. Besides these elements, several mutational biases related with replication have been described in bacterial genomes (reviewed in Rocha, Annu Rev Genetics, 08).

Gene strand bias

Chromosomal replication often takes place in moments of intense transcription and collisions between DNA and the RNA polymerases (DNAP and RNAP) are inevitable. Although RNAP transcription rate varies with the growth phase, it is usually in the range 40-50 nt/s, thus 20 times slower than DNAP in E. coli. The transcription of genes coded in the lagging strand leads to head-on collisions whereas transcription of leading strand genes lead to co-oriented collisions (Figure 2). As a consequence of the different probabilities and consequences of collisions between polymerases there is an asymmetric distribution of genes between the two strands.

 
Figure 2- Differential outcome of DNAP and RNAP collisions when genes are transcribed in the leading or the lagging strands (from Rocha, Microbiology, 04).

The frequency of leading strand genes is ~75% in B. subtilis, and ~55% in E. coli. On average, 78% of the genes of Firmicutes (including Mycoplasmas) are in the leading strand, to be compared to 58% for the other genomes. Interestingly, we have found that the group with higher biases coincides with the group of genomes containing two different (and probably strand-dedicated DNAP α-subunits at the replication fork (PolC and the homologue of E. coli DnaE). This suggests that compositionally different DNAPs are correlated with different levels of gene strand bias, but also that expression levels can hardly account for all the observed trends in gene strand bias. The number of genes that are highly expressed in bacteria is typically low, and cannot justify the high frequency of leading strand genes in Firmicutes. Furthermore, according to the polymerases collision model, gene strand bias should be higher in fast-growing bacteria, where transcription and replication (hence collisions) are very frequent and where fast growth is an important component of the global fitness. However, the correlation between gene strand bias and minimal growth rates is not significant. We could demonstrate that expression is not a determinant of gene strand bias in B. subtilis and E. coli. In B. subtilis, the frequency of leading strand essential genes (96%) and non-essential genes (74%) is very different and is independent of expression levels (Figure 3). Qualitatively similar results are found in E. coli when high expression is defined using codon usage biases, transcriptome or proteome data. These results seem to hold when essentiality is assigned by homology in most other bacterial genomes. In all cases, essential genes are more biased than non-essential genes, and among non-essential genes, there is rarely a significant effect attributable to expression level. Finally, when comparing the location of orthologues in close genomes, essential genes are conserved in the leading strand more often than the other genes.


Figure 3- Distribution of genes in the leading (black bar) and lagging (white bar) strands in the genome of B. subtilis, classed according to essentiality and expressiveness (from Rocha, Nature Genetics, 03).

If the major problem associated with collisions between polymerases were replication slow-down, then one would expect higher biases among the genes leading to higher collision rates, i.e. among highly expressed genes. This suggests that the problem of collisions is not related with its rate (i.e. with expression levels), but with the function of the gene being expressed at the moment of collision. Gene strand bias for essential genes has been interpreted as resulting for selection against mutagenesis caused by replication fork restart by homologous recombination. Yet, we have shown that highly expressed genes, not essential genes, are the less tolerant to mutagenesis. Hence, if selection acted on avoidance of local mutagenesis it would result in gene strand bias of highly expressed genes, not essential genes. In co-oriented collisions, the transcript may be finished, whereas head-on collisions result in aborted transcripts (Figure 2). The latter may be translated into truncated non-functional peptides, which is particularly deleterious for essential functions. If this model proves correct, it suggests that truncated transcripts may be more poisonous than previously thought, in spite of the ubiquitous presence of tmRNA systems. Alternatively, collisions might increase gene expression stochastic noise. Such noise is more deleterious if genes are essential. The last factor will be particularly important if head-on collisions lead to more frequent replication fork arrests as this will render the genomic locus unavailable for transcription for a substantial period of time.

Compositional strand bias

During the replication of the Okazaki fragments, the leading strand is kept single-stranded while the neo-formed lagging strand is being synthesised. Yet, the lagging strand stays double-stranded when the neo-formed leading strand in being synthesised. Some types of mutations differentially increase in ssDNA leading to compositional differences between the replicating strands. Around 90% of bacterial genomes present such biases. Genes that switch between replicating strands following chromosomal rearrangements evolve faster and quickly acquire the composition of the new replicating strand. Thus, strand bias is probably neutral and evolves fast.

Many hypotheses have been put forward to explain replication associated compositional strand bias, but none seems entirely satisfactory. In the light of our recent results using very extensive genome alignments, the somewhat paradoxical explanation is that this ubiquitous bias is multi-factorial. One should note that the most frequently cited reason for compositional strand bias, cytosine deamination in ssDNA, could explain a large fraction of strand bias in four out of seven genomes if it accounts for all or a large fraction of C->T substitution asymmetries (Figure 4). Yet, it totally fails to explain the bias in the other three genomes, where C->T changes are symmetric. This is most significant because two of them are the ones showing stronger GC skews. The seemingly inevitable conclusion is that an apparently homogenous compositional bias (GC skew), grounded on a fundamental and highly conserved cellular process (replication) can still have a multi-factorial origin where each factor has very different relevance in different genomes. A puzzling remaining question is then why do all these different mutational biases lead to higher GC skew in the leading than in the lagging strand in so many diverse genomes?
 

Figure 4 - Difference between the pairs of symmetric substitutions in different genomes. +/- : significantly different from 0 (p<0.05) (from Rocha, Genome Research, 06).

Relevant references from the lab

Rocha, EPC, A Danchin, A Viari. 1999. Universal replication bias in bacteria. Mol Microbiol 32:11-16.
Rocha, EPC, A Sekowska, A Danchin. 2000. Sulphur islands in the Escherichia coli genome: markers of the cell’s architecture? FEBS Lett 476:8-11.
Rocha, EPC, A Danchin. 2001. Ongoing evolution of strand composition in bacterial genomes. Mol Biol Evol 18:1789-1799.
Rocha, EPC. 2002. Is there a role for replication fork asymmetry in the distribution of genes in bacterial genomes? Trends Microbiol 10:393-396.
Rocha, EPC, A Danchin. 2003a. Essentiality, not expressiveness, drives gene strand bias in bacteria. Nat Genet 34:377-378.
Rocha, EPC, A Danchin. 2003b. Gene essentiality as a determinant of chromosomal organization in Bacteria. Nucleic Acids Res 31:6570-6577.
Rocha, EPC, J Fralick, G Vediyappan, A Danchin, V Norris. 2003. A strand-specific model for chromosome segregation in bacteria. Mol Microbiol 49:895-903.
Rocha, EPC. 2004. The replication-related organisation of the bacterial chromosome. Microbiology 150:1609–1627.
Rocha, EP, M Touchon, EJ Feil. 2006. Similar compositional biases are caused by very different mutational effects. Genome Res 16:1537-1547.
Touchon, M, EP Rocha. 2007. From GC skews to wavelets: A gentle guide to the analysis of compositional asymmetries in genomic data. Biochimie.
Rocha, EPC. 2008. The organisation of the bacterial genome. Annu Rev Genet 42:211-233.