Unit: Genome Analysis and Integration (Platform)
Director: Ivan MOSZER
The activities of the Platform 4 (PF4) "Genomic Integration and Analysis" are many: assembly and initial annotation of new genome sequences, curation of annotated data collections, software development of genomic databases, genome analysis and molecular phylogeny. A large fraction of this work is performed in the framework of collaborations with other groups of the Institut Pasteur: research units, Genopole, computing center, and teaching department.
Assembly and annotation of new genome sequences (L. Frangeul, S. Bun)
We are developing the CAAT-Box software ("Contig-Assembly and Annotation Tool-Box"), which contains a set of methods for the follow-up of the assembly phases of a genome sequencing project, and for initiating the annotation phase before the completion of the full genome sequence (finishing stage).
Several collaborations with groups involved in large-scale sequencing projects are maintained through this software, such as the two prominent examples described below. We are participating to the genome sequencing project of the cyanobacterium Microcystis aeruginosa, in collaboration with the Unité des Cyanobactéries (N. Tandeau de Marsac) and the Genopole PF1 Genomics (C. Bouchier): finishing is under progress, but is hampered by the numerous repeats present in this genome. In collaboration with the Unité de Génétique Moléculaire des Levures (B. Dujon) and in the framework of the Génolevures 2 project, we are participating to the annotation of the yeast Debaryomyces hansenii genome, and to its comparison with the genome of Candida albicans.
Meanwhile, two additional modules are being developed in CAAT-Box: the first one is intended to the systematic detection of gene repeats, while the second one integrates various methods for the prediction of links between contigs, thus displaying synthetic results and the super-contigs obtained.
Annotation of bacterial genomes (C. Boursaux-Eude)
We have started the annotation of the saprophytic bacterium Leptospira biflexa, with M. Picardeau (Laboratoire des Spirochètes) and the Genopole PF1 (C. Bouchier). We have chosen to use MaGe as the annotation tool, developed by C. Médigue and coll. at the Génoscope. The small chromosome (277.7 kbp) was annotated during the assembly phase, while annotation ot the large chromosome is under progress. Of particular interest was the annotation of the genes potentially involved in iron metabolism. At the end of 2005, about 70% of the genome was annotated. Other projects for the annotation and re-annotation of bacterial genomes are being evaluated, such as the annotation of the genome of an Helicobacter pylori strain involved in MALT lymphoma.
Bacterial multi-genome database (P. Lechat, L. Hummel)
GenoList is a database dedicated to the query and analysis of genomic data from bacterial species (http://genolist.pasteur.fr/). The current version integrates genome data from 66 species (with associated plasmids). A user interface has been developed for querying and navigating the data, including sequence analysis tools (BLAST, pattern search) and subtractive genome analysis tools (DiffTool, FindTarget). This allows the user to browse the data in a powerful and intuitive way. Recent work was particularly focused on the rigorous import of genome annotation, and on the optimization of comparative analysis features, which are usually CPU-demanding. A collaboration with an industrial partner has led to the development of specific functions. Additional functionalities are being developed (e.g. circular graphical and multi-genome maps), and new genomes will be integrated following the needs of biologist users.
Transcriptome database (S. Moreira, C. Laurent, P. Latouche)
We have developed GenoScript, a database dedicated to data obtained from DNA chips profiling experiments (http://genoscript.pasteur.fr). GenoScript allows the user to enter experiments and associated results, and to query and perform statistical analysis on the data. The interface of the database can be easily customized by adding specific information related to a particular project. Querying the data can be performed either through predefined requests, or through an interface making it possible to build complex multicriteria queries. More recently, a new interface for visualizing results according to classifications (functional, metabolic, etc.) has been developed. A module for the statistical analysis of experiments with a "basic" design is available (coll. M.-A. Dillies, PF2 "DNA Chips", G. Guigon, PF8 "Santé Publique", and C. Laurent, Unité de Génétique des Génomes Bactériens), through the use of R functions from the Bioconductor library.
Genotypic biodiversity of parasitic strains (D. Dioum, S. Moreira)
In the framework of a Genopole project coordinated by R. Jambou (IP Senegal) and O. Puijalon (Unité d'Immunologie Moléculaire des Parasites), and involving several Instituts Pasteur from the Réseau International (Cambodia, Madagascar, Guyana), we have developed a database dedicated to molecular epidemiology data. The underlying multicentric project aims at the identification of correlations between genotypic variations of Plasmodium strains and clinical criteria, especially strain sensitivity to anti-malarial drugs. The application contains a module for the entry of epidemiological, clinical, and molecular data (including sequences of target genes) through Web forms or Excel files, a module for the detection of polymorphisms, and a module for multicriteria queries. Results can be formatted and exported in order to perform further data analysis and generate graphical reports.
Phylogeny for genome exploration (S. Goupil, C. Dauga)
Our aim is to create original strategies for comparative genomics, based on phylogenetic concepts and methods. These strategies are dedicated to the identification of genes acquired by genetic transfers (xenologs) or generated by duplications (paralogs). Accurately identifying genetic mechanisms helps biologists to recognize genes involved in the adaptation to new environments, and to better understand genome evolution. Several simulation programs were used first to obtain sequences evolving by random recombinations, and then to evaluate methods to detect and represent genetic exchanges. We are collaborating with J. Raymond (Hôpital Saint Vincent de Paul) and A. Labigne (Unité de Pathogénie Bactérienne des Muqueuses) to understand genetics and transmission of strains of H. pylori within a family, using phylogenetic trees based on highly variable genes and macroarray analysis.
Phylogenetic approaches for epidemiological studies of infectious diseases (X. Bebin, C. Dauga)
Recently, we have implemented strategies and tools to perform molecular identification, phylogeography and evolutionary dynamics of populations of insects vector of infectious agents. In collaboration with A.-B. Failloux (Unité de Génétique Moléculaire des Bunyaviridae), we have studied the geographic origin of Aedes albopictus, possible vectors of the dengue virus, recently imported in France. With this resarch unit and the Réseau International des Instituts Pasteur, we are also contributing to an evolutionary ecology project for studying the vectors of the Rift Valley Hemorrhagic Fever.
Analysis of the Anopheles gambiae genome (P. Dehoux, E. Perlade, P. Lechat, C. Dauga)
We are a member of an international consortium who performed the sequencing of the A. gambiae genome, the main vector of malaria. In this framework, we are collaborating with the Unité de Biochimie et de Biologie Moléculaire des Insectes (C. Roth, P. Brey - GPH "Anopheles") to improve its annotation, notably through the use of full-length cDNA libraries and in silico analysis of the proteome. We are particularly studying genes which are expressed in salivary glands of the mosquito, and more generally secreted proteins of A. gambiae, which are potential targets for inhibiting the parasite transmission. Finally, we are using phylogenetic methods for distinguishing between orthologs and paralogs, in order to perform a genomic comparative study on Serine Proteases (comparison to Drosophila melanogaster), a superfamily of genes involved in anti-infectious responses and intracellular signalization in A. gambiae.
Metabolic pathway reconstruction (D. Simon)
We are developing a new approach for the ab initio reconstruction of metabolic pathways, from annotated genomic data. The rationale consists in determining possible chaining of reactions, then combining and filtering them according to various criteria in order to define realistic metabolic pathways. We thus do not take into account generic pathways defined for a limited number of organisms: this way we expect to discover new alternative pathways. Developed in the framework of a European project on functional genomics in Mycobacterium tuberculosis, the rules thus defined are first applied to this organism first, and then generalized to other bacteria.
Teaching activities (C. Boursaux-Eude, C. Dauga, P. Dehoux, L. Frangeul)
Co-organization of the IP Genome Analysis course (data processing and analysis);
Participation to the IP Informatics in Biology course (supervision of two trainees);
Lecture and class work for Bioinformatics teaching of Paris 7 university;
Phylogeny lecture and class work for the Master Specialized in Bioinformatics (Institut Informatique d'Entreprise - CNAM Essonne);
Phylogeny lecture for the Master Biology and Health, Specialty Environment and Infectious Agent Interactions (Évry & Versailles - St Quentin en Yvelines);
Participation to the IP Medical Entomology course.
Keywords: annotation, database, genome, transcriptome, biodiversity, phylogeny, metabolism