A common set of question I receive is, how do I:
- screen my sequence of interest against the repeatome of an organism.
- identify if my sequence of interest has new repeats.
These two questions can be answered from a biologists perspective using existing tools. A computer scientist might find some interesting problems associated with these questions but that would be beyond the scope of this short post.
Professor wikipedia gives a great definition of the term I just used repeatome (sarcasm), however it does link to a paper which may give you better background. In essence the repeatome is the catalog of repetitive DNA sequences in a genome. The repeatome (no more italics because you should know it by know) can be constructed using two paradigms.
- comparative biology, or using a database of known repeats to annotate the unknown sequence. The most common tool used is RepeatMasker.
- de novo, which uses some type of algorithm and kmer frequency to find repetitive regions. The most common tool is RepeatModeler
The two approaches are different but each yields a repeatome, although they won’t always agree. RepeatMasker uses a database, usually Repbase, and a mapping tool like BLAST to annotate the supplied sequence. RepeatModeler first identifies repeat motifs using various algorithms, then this database is supplied to RepeatMasker to annotate the genome.
Answers
The questions are very similar and we can accomplish them using RepeatMasker and then RepeatModeler. First we want to know if my query overlaps with the existing repeatome. If the sequence is small, there is a web interface for RepeatMasker. If the sequence is large then it must be run on the command line like so:
RepeatMasker -gff sequence.fasta
Assuming you have RepeatMasker and Repbase installed and configured this command will annotate “sequence.fasta” and place several files in the directory where the command was executed. The two important files are:
- sequence.fasta.masked
- sequence.fasta.gff
The masked file is the fasta file with repeats represented by lowercase letters. The GFF file is the annotation file which tells which bases are considered to be repeats and what those repeats are. This should answer question 1, we need to use RepeatModeler to discover new repeats in question 2.
If you which to run RepeatModeler the following commands will be used:
/opt/RepeatModeler/BuildDatabase -name sequence sequence.fasta RepeatModeler -database sequence RepeatMasker -gff -lib RMXYZ/consensi.fa.classified sequence.fasta
The first command creates a database for the RepeatModeler tools. The second command executes the de novo repeat finding tools and generates a database of repeat motifs in the file “consensi.fa.classified”. This file is then used as a database for RepeatMasker. If there are entries the the GFF output by repeat masker then those are the denovo repeats in your sequence.
Hopes this helps!
