Microbial evolutionary genomics

Repeatoire: De novo detection and multiple alignment of repeat families

Section 1: Introduction
Repeatoire is a bioinformatics tool for locating DNA repeats inside of sequenced (but not necessarily assembled) genomes. Specifically, our focus is on repeat families, i.e. related sets of repeats that share a similar role and origin. A non-trivial question is what exactly is a DNA repeat? On the surface, a repeat is a subsequence of a given genome that resembles another subsequence in the same genome. Thus, one is usually interested in repeats unexpected in a random assembly of the genetic text. However, when considering the biology behind these repeats, they suddenly become much more exciting and complex. Biologically, repeats are a source of functional overlapping and sequence recombination. Repeats may include entire genes or operons, in which case functional redundancy arises from the overlapping of functions between the two copies. Also, recombination is heavily dependant on repeats as most recombination processes require some level of sequence similarity between the sequences for it to occur. Thus our intention with repeatoire is to identify these biologically relevant interspersed repeats that may be drivers of recombination or sources of functional overlap, or markers of other exciting phenomena.

Section 2: Download and License
We have implemented our method in a program, Repeatoire, available for Linux, Windows, and Mac OS X and can be downloaded on this website below. Daily development snapshots containing the latest changes can be found at: http://gel.ahabs.wisc.edu/mauve/snapshots. Source code can be downloaded using subversion from the following repository: http://mauve.svn.sourceforge.net/svnroot/mauve/repeatoire/. And the Mobyle webserver can be accessed at: http://mobyle.pasteur.fr/cgi-bin/portal.py?form=repeatoire. License: GNU General Public License (GPL)

Section 3: Algorithm Summary
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software,
Reference.Todd J. Treangen, Aaron E. Darling, Guillaume Achaz, Mark A. Ragan, Xavier Messeguer, Eduardo P.C. Rocha, "A Novel Heuristic for Local Multiple Alignment of Interspersed DNA Repeats," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 180-189,

Section 4: User guide: 

Section 5: Download:

Platform program binary
Linux 32 bit repeatoire
Linux 64 bit repeatoire
Mac OS X Universal
Windows repeatoire

Main Contributors:
Todd J Treangen
Aaron Darling
Eduardo PC Rocha