Unit: Genomic analysis
Director: Ivan MOSZER
The activities of the Platform 4 "Genomic Integration and Analysis" are many: assembly and initial annotation of new genome sequences, curation of annotated data collections, software development of genomic databases, genome analysis and molecular phylogeny. A large fraction of this work is performed in the framework of collaborations with other groups of the Institut Pasteur: research units, Génopole, computing center, and teaching department.
Assembly and annotation of new genome sequences (L. Frangeul, H. Nedjari, M. Brugnon)
We are developing the CAAT-Box software ("Contig-Assembly and Annotation Tool-Box"), which contains a set of methods for the follow-up of the assembly phases of a genome sequencing project, and for initiating the annotation phase starting from the finishing step (to get further details about CAAT-Box, please see prior activity reports). Several collaborations with groups involved in large-scale sequencing projects are maintained through this software, such as the two prominent examples described below.
The Unité de Génétique Moléculaire des Levures (B. Dujon) has undertaken the complete genome sequencing of Candida glabrata, in the framework of a Génopole project. We performed the assembly of 180,000 sequences produced by both the Institut Pasteur Génopole (PF1 Genomics, C. Bouchier) and the Évry Génopole, and thus obtained over 700 contigs. In collaboration with PF1, we corrected low quality regions of these contigs and developed several strategies to close the gaps. We obtained 18 contigs (13 Mbases) almost spanning the entire 13 chromosomes. In collaboration with the Laboratoire Bordelais de Recherche en Informatique (Université Bordeaux 1, D. Sherman), CAAT-Box was used for the annotation of the C. glabrata genome, and of three other yeast genomes (Yarrowia lipolytica, Debaryomyces hansenii and Kluyveromyces thermotolerans), in the framework of the Génolevures 2 project.
Following the public release of the complete genome sequence from seven strains of Staphylococcus aureus (N315, Mu50, MW2, MRSA, MSSA, 8325, COL), we have undertaken, in collaboration with the Unité des Staphylocoques (N. El Solh), an exhaustive comparison of the 14,000 corresponding CDS. This allowed us to build a list of significant CDS: diverging genes between strains, or genes potentially involved in the pathogenic nature of the bacterium. We then used CAAT-Box to determine the sequence of specific primers, in order to design a macro-array containing 465 amplicons corresponding to these CDS. Using this membrane, we studied gene expression from these CDS in 80 clinical strains of S. aureus.
Annotation and re-annotation (C. Boursaux-Eude, P. Dehoux, M. Pryor, O. Garcia)
We have completed the re-annotation of the genome sequence of Mycobacterium leprae, three years after the first submission. Each of the coding sequences previously described was re-evaluated and re-annotated to try and assign a new or more precise function when possible, based on the combined results of in silico analysis and literature searches. A large virulence plasmid (170 kb) identified in Mycobacterium ulcerans was also annotated.
In order to define global strategies for genome annotation, we organised a one day-workshop dedicated to this topic, gathering most scientists involved in such activities in France. We also evaluated several annotation tools, among them Genostar, a software developed by a consortium Institut Pasteur is part of.
Finally, we are a member of an international consortium who performed the sequencing of the Anopheles gambiae genome, the main vector of paludism. In this framework, we are collaborating with the Unité de Biochimie et de Biologie Moléculaire des Insectes (P. Brey) to improve the accuracy of the genomic sequence and the identification of the proteins it encodes. We are also studying in silico membrane transporters that resemble drug-resistance proteins. These studies could help in the deciphering of protein-toxin interactions and in the improvement of insecticides.
GenoList: a bacterial multi-genome database (S. Moreira, L. Hummel, S. Grandino, P. Casel, H. Madaoui)
GenoList is a database dedicated to the query and analysis of genomic data from bacterial species. The previous version of GenoList was accessible through several independent mono-genome Web servers. We revised in-depth the conceptual schema of GenoList, by integrating new information related to multi-genome features, and by modeling data in a more efficient way from the computing point of view and in a more realistic way with respect to biological processes. We also designed a new query interface and added new analysis tools dedicated to genomic comparative analysis. For example, subtractive genome analysis tools such as DiffTool and FindTarget will be directly accessible from within the GenoList interface. During the developement of the application, we always try to achieve three main goals: a user-friendly presentation of the data, an intuitive browsing model, and analysis tools tightly integrated into the application and closely linked to the data.
GenoScript: a transcriptome database (S. Moreira, G. Lacourrège, L. Marisa)
GenoScript is a database dedicated to the submission, query and analysis of transcriptome experiments performed using the DNA chip technology. We are currently developing the second version of this application which features a lot of improvements: the submission process is more flexible, experiments can be modified afterwards, data consistency is controlled more efficiently, etc. (coll. G. Lacourrège, PF2 Puces à ADN, Génopole IP). The major improvement is the integration of statistical analysis tools (coll. L. Marisa, Unité de Génétique des Génomes Bactériens). We have connected the statistical environment R to GenoScript through a client-server architecture. An interface was built as a front-end to guide the user through the various steps of the analysis, helped by relevant statistical methods. We also modified the database structure to store information specific to each project without side effects on other projects (e.g. glass slide vs. membrane experiments, eukaryotic vs. bacterial organisms). This work is partly motivated by requests from people we are collaborating with (PF2 - J.-Y. Coppée, Unité de Génétique des Génomes Bactériens - A. Danchin, Unité d'Immunologie Moléculaire des Parasites - P. David, Laboratoire de Génomique des Microorganismes Pathogènes - P. Glaser), thus replying to strong needs for such applications.
Phylogeny for genome exploration (C. Dauga, N. Beaume)
We are developing phylogenetic strategies to identify lateral gene transfers, genetic recombinations and gene duplications, based on the identification of gene evolution models and on the choice of the best approaches to extract phylogenetic information from sequences. This requires the use of phylogenetic concepts, and the knowledge of the limits of phylogenetic tree building methods.
A new project to detect lateral gene transfers between neighbouring species is being developed. Four phylogenetic tests (the Incongruence Length Difference test, two topology tests - Kishino & Hasegawa (KH) and Shimodaira & Hasegawa (SH) -, and a coevolution test) were evaluated on four samples of bacterial genes acquired by horizontal transfer. The KH test and the SH test both detected transfers between species from different domains as well as within-species transfers. The Incongruence Length Different test and the coevolution test showed low-performance results and detected other evolutionary events. All tests were influenced by long branches in phylogenetic trees. These tests are now evaluated in the presence of other phylogenetic artifacts. The best tests will be selected and combined in order to automate transfer detection for whole genome analysis.
In collaboration with H. de Reuse (Unité de Pathogénie Bactérienne des Muqueuses), we determined the evolutionary mode of amiE and amiF, two genes coding for aliphatic amidases in Helicobacter pylori. Phylogenetic studies and recent genomic data strongly suggest that these genes were acquired through genetic transfers. These strategies are also useful for molecular identification and epidemiological studies of infectious diseases. We are collaborating with J. Raymond (Hôpital Saint Vincent de Paul) and A. Labigne (Unité de Pathogénie Bactérienne des Muqueuses) on the phylogeny of genes evolving by genetic drift and genetic recombinations in H. pylori, in order to understand the contamination mode of a family. We also collaborate with T. Garrigues and A.-B. Failloux (Unité Insectes et Maladies Infectieuses) to characterize a new sub-population of vipers (Vipera aspis aspis) with neurotoxic venom, and to study two variable genes of Aedes albopictus in order to determine the geographic origin of recently imported strains.
Metabolic reconstruction (D. Simon)
A PhD thesis work was undertaken on the subject of the definition of new strategies for the reconstruction ab initio of metabolic pathways and networks, from annotated genomic data. Developed in the framework of a European project on functional genomics in Mycobacterium tuberculosis, the rules thus defined will be applied to this organism first, and then generalised to other bacteria.
Teaching activities (C. Boursaux-Eude, C. Dauga, L. Frangeul)
Teaching is one of the missions of PF4:
Setting up of an initiation to bioinformatics for the IP General Microbiology course;
Participation to the IP Genome Analysis course (data processing and phylogenetic analysis);
Participation to the IP Informatics in Biology course (supervision of two trainees);
Lecture and class work for Bioinformatics teaching of Paris 7 university;
Co-organisation and phylogeny lecture: ICRO-UNESCO Training Course - "Microbial Identification in Clinical and Environmental Settings", Kénitra et Rabat, Maroc;
Co-organisation and phylogeny lecture: European Training Workshop of Gut Health Support Action - "Molecular characterization of the human intestinal microbiota", Paris.
Keywords: annotation, databases, genome, transcriptome, phylogeny, metabolism