Computational genomics

Figure 1. Left panel B: RNA gel shifts demonstrating the formation of a pairing complex between the ncRNA RliI (one of the nine novel ncRNAs that we discovered in L. monocytogenes) and one of the targets that we predicted computationally, as shown in the right panel. The pairing score S is proportional to the strength of the pairing and the value of the pairing score between RliI and the lmo1035 mRNA clearly deviates from the random blue curve. Left panel C: In vivo demonstration of the regulati

Non-coding RNAs in bacteria

Non-coding RNAs (ncRNAs) have recently emerged as ubiquitous and versatile players in regulation, both in prokaryotes and eukaryotes. Their role appears more and more relevant in those situations where a rapid response and adaptation to variable environmental and/or developmental conditions is required. In bacteria, a growing number of regulatory processes are being found to hinge on RNAs. In addition to antisense RNAs, thermo-sensors and riboswitches, which act in cis, a large class of ncRNAs acts in trans by pairing to mRNAs, regulating their degradation and/or translation rates. Several questions on the identification of ncRNAs and their functional role are open and, in collaboration with P. Cossart and her group (Institut Pasteur), we develop a project on these issues for the bacterial pathogen L. monocytogenes. Results obtained so far are briefly summarized hereafter.

A set of 12 ncRNAs were identified using bioinformatics genomic screens based on the conservation of primary intergenic sequences, their folding properties and the analysis of orphan transcription terminators. 3 of those ncRNAs are conserved in all bacteria (RnpB, SsrA and SsrS) while 9 were previously unknown. Among them, 5 are absent from the non-pathogenic species Listeria innocua, suggesting their possible role in virulence, and one of them features a striking series of 29 nucleotide repetitions, well conserved and duplicated in the related species L. ivanovii. The transcription of ncRNAs was confirmed by Northern blots and 5’ RACE experiments to determine transcriptional starts and processing sites. About 30 additional ncRNAs were then found by using tiling array data of L. monocytogenes. Furthermore, the data have evidenced a differential expression of two ncRNAs (RliB, which we identified in [1], and a new one identified by tiling arrays) once they are in the presence of human blood. The two ncRNAs both affect the virulence of Listeria in mice and we have predicted several targets of the two ncRNAs. Putative targets are involved in iron metabolism and thus look extremely promising for virulence. They are currently being experimentally tested and verified.

The most interesting aspect to us concerns the mechanisms ensuring the specificity of the regulation by those ncRNAs acting in trans and, more specifically, the prediction of the genes they target. This type of information is useful to identify the functional role of ncRNAs, a major problem in the field since for a great majority of known ncRNAs no functional information is available and their deletion and/or over-expression does not have major effects in standard phenotypic tests. In order to predict mRNA targets of ncRNAs, we developed a novel computational method, motivated by the fact that targets generally escape detection by standard alignment tools, e.g. BLAST or variants thereof. Indeed, the pairing between ncRNAs and their mRNA targets typically involves bulges and internal loops that prevent using standard alignment methods. The new method is based on the computation of a score, quantifying the quality of the best pairing between subsequences of the ncRNA and the putative target sequence. Scores are based on thermodynamic pairing energies and the cost of bulges and internal loops is gauged on hybrids experimentally validated in vivo (DsrA and Spot42 in E. coli and RNAIII in S. aureus). Alignments are efficiently calculated using dynamic programming techniques and, most importantly, the scoring system allows dealing with genomes with high AT content, e.g. > 60% in Listeria. The statistical significance of the results is gauged by measuring the null probability distribution of the scores obtained in random sequences with the same nucleotide transition probabilities as in the real genome. In L. monocytogenes, 3 of the novel ncRNAs had significant mRNA targets, which we tested experimentally both in vivo, by overexpressing the corresponding ncRNA, and by in vitro RNA gel shifts, to confirm the pairing. Predictions are in excellent agreement with the experiments and the target prediction method that we developed appears as an effective tool in the search for bacterial ncRNA functions.

[1] Identification of new noncoding RNAs in Listeria monocytogenes and prediction of mRNA targets. P. Mandin*, F. Repoila*, M. Vergassola*, T. Geissmann & P. Cossart Nucleic Acids Research, 35: 962-74 2007 (*Equal contributions).

[2]. The Listeria transcriptional landscape from saprophytism to virulence. Toledo-Arana A, Dussurget O, Nikitas G, Sesto N, Guet-Revillet H, Balestrino D, Loh E, Gripenland J, Tiensuu T, Vaitkevicius K, Barthelemy M, Vergassola M, Nahori MA, Soubigou G, Régnault B, Coppée JY, Lecuit M, Johansson J, Cossart P. Nature. 459 950-6 2009.

Codon Bias

In [1] we developed a novel clustering method based on maximizing the information on the posterior probability distributions of codon usage in the various clusters. The number of clusters is selected choosing the configuration with the maximum stability as compared to the null model where codon counts for the genes are randomly generated from the posterior distribution for a single cluster. The method avoids problems of general-purpose clustering methods because it weights the probability of the whole configuration of clusters and is not based on pairwise distances among genes. The result emerging from the application of the method to E. coli and B. subtilis is the fact that correlations in the usage of codons are more extended than what could be accounted by the constraints related to operons. In other words, genes with similar codon usages tend to be close on the chromosome. We have argued that, in addition to the known nucleotide correlations, a contribution to those correlations stems from selective pressure acting at the translation level on rare codons. A consequence is that the expression level of proteins should have a context-dependent contribution, i.e. depend on genes in the chromosomal neighborhood.
Another aspect of codon bias, explored in [2], is the fact that some phages have a few tRNA copies in their genome. This is intriguing as phages rely on their bacterial hosts for translation. Is this due to random insertion events of DNA of their hosts or is this to compensate for the bias of the host? In the former case, tRNAs in phages should be of the most common type in their hosts; in the latter case, one would expect the tRNAs in phages to be the rarest in their hosts. In [2], we found that those tRNAs present in phages tend to correspond to codons that are simultaneously highly used by the phage genes, while rare in the host genome. We also analyzed the differences between temperate and virulent phages. Virulent phages contain more tRNAs than temperate ones, higher codon usage biases, and more important compositional differences with respect to the host genome.

[1]. Codon usage domains over bacterial chromosomes. M. Bailly-Bechet, A. Danchin, M. Iqbal, M. Marsili & M. Vergassola PLoS Computational Biology 2 (4): 263-275, 2006. Selected by Faculty 1000 Biology.

[2]. Causes for the intriguing presence of tRNAs in phages. M. Bailly-Bechet, M. Vergassola & EPC Rocha 2007 Genome Research, 17, 1486-1495, 2007.

Future objectives in computational genomics

In addition to continuing on ncRNAs, we shall initiate a new project on computational methods for population genetics data. In particular, the program organized by M. Kreitman, L. Quintana-Murci and M. Vergassola at KITP during the fall 2008 was an excellent occasion to get acquainted with the population genetics community and identify relevant issues. Among them, computational methods for populations undergoing recombination seem particularly challenging and in need of basic contributions. Reconstruction of evolutionary histories poses a non-trivial challenge, which is currently dealt by standard Monte-Carlo methods and/or summary statistics that present rather strong analogies with problems and methods employed in statistical physics. Existing methods yield maps that give a good qualitative picture of recombination rates but they are far from perfect and several points remain mysterious. In particular, several regions of Drosophila genome, which are currently tagged as low recombination regions, are characterized by a relative rapid evolutionary dynamics. It is unclear whether this is due to a limitation of the recombination rate estimates or the presence of different genetic mechanisms, e.g. gene conversion. The project will be developed in collaboration with P. Andolfatto (Princeton).