Unit: Genome Analysis and Integration (Platform)
Director: Ivan MOSZER
The activities of the Platform 4 (PF4) "Genomic Integration and Analysis" are many: assembly and initial annotation of new genome sequences, curation of annotated data collections, software development of genomic databases, genome analysis and molecular phylogeny. A large fraction of this work is performed in the framework of collaborations with other groups of the Institut Pasteur: research units, Genopole, computing center, and teaching department.
Assembly and annotation of new genome sequences (L. Frangeul, S. Bun)
We are developing the CAAT-Box software ("Contig-Assembly and Annotation Tool-Box"), which contains a set of methods for the follow-up of the assembly phases of a genome sequencing project, and for initiating the annotation phase starting from the finishing step.
Several collaborations with groups involved in large-scale sequencing projects are maintained through this software, such as the two prominent examples described below. We are participating to the genome sequencing project of the cyanobacterium Microcystis aeruginosa, in collaboration with the Unité des Cyanobactéries (N. Tandeau de Marsac) and the Genopole PF1 Genomics (C. Bouchier): finishing is under progress, and the annotation of the first contigs is starting using CAAT-Box.
The Unité de Génétique Moléculaire des Levures (B. Dujon) has undertaken the complete genome sequencing of Candida glabrata, in collaboration with PF1 and PF4. This genome sequence has been recently published, together with those of three other yeast genomes (Yarrowia lipolytica, Debaryomyces hansenii and Kluyveromyces thermotolerans), in the framework of the Génolevures 2 project. Sequences are still being annotated and checked by comparing the colinearity with a BAC library.
Annotation and re-annotation: tools, standards, and projects (C. Boursaux-Eude)
We are still evaluating the bioinformatics platform for exploratory genomics, GenoStar, especially the annotation module. Our comments are sent to the GenoStar consortium.
There is no official annotation rules to date, we are therefore trying to define a consistent annotation syntax, in order to make our internal annotation strategies more homogeneous. For instance, we felt relevant to add a number of qualifiers to the official "Feature Table" of the INSD, or to make others more precise (e.g. to indicate the localization of a protein feature, as well as its type, identification method, etc.).
We have started a new collaboration on the genome sequencing project of the saprophytic bacterium Leptospira biflexa, with M. Picardeau (Laboratoire des Spirochètes) and the Genopole PF1 (C. Bouchier). We have chosen to use MaGe as the annotation tool, developed by C. Médigue and coll. at the Atelier de Génomique Comparative (Génoscope, Évry). Annotation will be compared to those of two pathogenic serovar of Leptospira interrogans, published in 2003 and 2004. Sequencing of the saprophytic strain will allow one to better understand virulence mechanisms, through comparative genomic studies. Furthermore, genetic tools currently used for spirochetes have been developed in L. biflexa only.
Bacterial multi-genome database (P. Lechat, L. Hummel, P. Casel, S. Moreira)
GenoList is a database dedicated to the query and analysis of genomic data from bacterial species. The new version makes it possible to integrate any number of genomes. We have first imported genomes available in the mono-genome version of GenoList (http://genolist.pasteur.fr/), with the help of C. Jorge (Pôle Informatique). A procedure for rigorously importing other public genomes is being developed, together with a new query interface and new analysis tools dedicated to genomic comparative analysis. For example, subtractive genome analysis tools such as DiffTool and FindTarget are directly accessible from within the GenoList interface. During these developments, we constantly try to achieve a tight integration of data and analysis tools, in order to create a powerful and intuitive browsing interface.
Transcriptome database (S. Moreira, C. Laurent)
We have developed GenoScript, a database dedicated to handle data from DNA chips profiling experiments. We have worked in particular on the experiment submission module, in order to facilitate the data entry process (from the initial culture to raw expression values). Furthermore, each user of the database can easily customize the interface by adding specific information related to its own project. We have also developed a module for the statistical analysis of experiments with a "basic" design (coll. M.-A. Dillies and G. Guigon, PF2 "DNA Chips", and C. Laurent, Unité de Génétique des Génomes Bactériens), through convivial interfaces making use of R functions from the Bioconductor library (http://www.bioconductor.org/). Future developments will pertain to new methods for graphical data visualization, and the creation of a new module for multicriteria queries.
Genotypic biodiversity of parasitic strains (D. Dioum, S. Moreira)
In the framework of a Genopole project coordinated by R. Jambou (IP Senegal) and O. Puijalon (Unité d'Immunologie Moléculaire des Parasites), and involving several Instituts Pasteur from the Réseau International (Cambodia, Madagascar, Guyana), we have developed a database dedicated to the storage and analysis of molecular epidemiology data. The underlying multicentric project aims at the identification of correlations between genotypic variations of plasmodium strains and clinical criteria, especially strain sensitivity to anti-paludic drugs. The first version of the application, containing a module for the entry of epidemiological, clinical, and molecular data (including sequences of target genes), and a module for the detection of polymorphisms, is currently being tested. Future developments will concern the integration of statistical and bioinformatical tools, and the creation of a new module for complex queries.
Phylogeny for genome exploration (C. Dauga, A.-L. Abraham, S. Goupil)
We are developing phylogenetic strategies based on the evolutionary modes of genes and the choice of the best approaches to extract phylogenetic information from sequences.
Phylogenetic tools useful to detect lateral gene transfers, duplications, and recombinations, are searched to be automated for whole genome analysis. The Incongruence Length Difference test, two topology tests (KH and SH), and a coevolution test, were evaluated for the detection of lateral gene transfers between closely related species. The resolution of these tests was evaluated on duplications and hidden paralogies, from examples taken from the literature or from sequences obtained through simulation. All tests were influenced by artifacts occurring when building phylogenetic trees, especially the presence of long branches. However, when used in combination, they are relevant in any situation.
Our phylogenetic approaches are also useful for molecular identification and epidemiological studies of infectious diseases. We are collaborating with J. Raymond (Hôpital Saint Vincent de Paul) and A. Labigne (Unité de Pathogénie Bactérienne des Muqueuses) to build the phylogeny of genes from Helicobacter pylori, in order to understand genetic and transmission of strains within a family. We have characterized, with T. Garrigues and A.-B. Failloux (Unité Insectes et Maladies Infectieuses), a new population of vipers with neurotoxic venom (Vipera aspis aspis) from the South-East of France, and studied the geographic origin of Aedes albopictus, possible vectors of the dengue fever, recently imported in France.
Analysis of the Anopheles gambiae genome (P. Dehoux, E. Perlade, C. Dauga)
We are a member of an international consortium who performed the sequencing of the A. gambiae genome, the main vector of paludism. In this framework, we are collaborating with the Unité de Biochimie et de Biologie Moléculaire des Insectes (C. Roth, P. Brey) to improve its annotation. We are studying in silico membrane transporters that resemble drug-resistance proteins, and the secretome of the mosquito. We have also started a phylogenetic study on Serine Proteases, a superfamily of genes involved in anti-infectious responses.
Metabolic pathway reconstruction (D. Simon, G. Meurice, L. Frangeul)
We are developing a new approach for the ab initio reconstruction of metabolic pathways, from annotated genomic data. The rationale consists in determining possible chaining of reactions, then combining and filtering them according to various criteria in order to define realistic metabolic pathways. We thus do not take into account generic pathways defined for a limited number of organisms: this way we expect to discover new alternative pathways. Developed in the framework of a European project on functional genomics in Mycobacterium tuberculosis, the rules thus defined will be applied to this organism first, and then generalized to other bacteria.
We have also developed a software for searching for protein motifs in the translation of large DNA sequences (whole chromosomes or sets of contigs). This tool, called AbIMoS, allows one to predict possible metabolic pathways without pre-existing annotations, even in unfinished genomes.
Teaching activities (C. Boursaux-Eude, C. Dauga, P. Dehoux, L. Frangeul)
Organization of one half of the IP Genome Analysis course (data processing and phylogenetic analysis);
Participation to the IP Informatics in Biology course (supervision of two trainees);
Lecture and class work for Bioinformatics teaching of Paris 7 university;
Phylogeny lecture and class work for the Master Specialized in Bioinformatics (Institut Informatique d'Entreprise - CNAM Essonne);
Phylogeny lecture for the Master in Biology, Specialty Hosts Infectious Agent Interactions (Évry & Versailles - St Quentin en Yvelines);
Participation to the IP Medical Entomology course.
Keywords: annotation, database, genome, transcriptome, biodiversity, phylogeny, metabolism