"RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program." (http://www.repeatmasker.org/)
RepeatMasker is available to all OSC users without restriction.
The following versions of RepeatMasker are available on OSC systems:
On the Glenn Cluster RepeatMasker is accessed by executing the following commands:
module load biosoftw module load RepeatMasker
RepeatMasker will be added to the users PATH and can be run with the command:
RepeatMasker [-options] <seqfiles(s) in fasta format>
-h(elp) Detailed help Default settings are for masking all type of repeats in a primate sequence. -pa(rallel) [number] The number of processors to use in parallel (only works for batch files or sequences over 50 kb) -s Slow search; 0-5% more sensitive, 2-3 times slower than default -q Quick search; 5-10% less sensitive, 2-5 times faster than default -qq Rush job; about 10% less sensitive, 4->10 times faster than default (quick searches are fine under most circumstances) repeat options -nolow /-low Does not mask low_complexity DNA or simple repeats -noint /-int Only masks low complex/simple repeats (no interspersed repeats) -norna Does not mask small RNA (pseudo) genes -alu Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA) -div [number] Masks only those repeats < x percent diverged from consensus seq -lib [filename] Allows use of a custom library (e.g. from another species) -cutoff [number] Sets cutoff score for masking repeats when using -lib (default 225) -species <query species> Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be contained in the RepeatMasker repeat database. Some examples are: -species human -species mouse -species rattus -species "ciona savignyi" -species arabidopsis Other commonly used species: mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu, danio, "ciona intestinalis" drosophila, anopheles, elegans, diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize
-is_only Only clips E coli insertion elements out of fasta and .qual files -is_clip Clips IS elements before analysis (default: IS only reported) -no_is Skips bacterial insertion element check -rodspec Only checks for rodent specific repeats (no repeatmasker run) -primspec Only checks for primate specific repeats (no repeatmasker run)
-gc [number] Use matrices calculated for 'number' percentage background GC level -gccalc RepeatMasker calculates the GC content even for batch files/small seqs -frag [number] Maximum sequence length masked without fragmenting (default 40000, 300000 for DeCypher) -maxsize [nr] Maximum length for which IS- or repeat clipped sequences can be produced (default 4000000). Memory requirements go up with higher maxsize. -nocut Skips the steps in which repeats are excised -noisy Prints search engine progress report to screen (defaults to .stderr file) -nopost Do not postprocess the results of the run ( i.e. call ProcessRepeats). NOTE: This options should only be used when ProcessRepeats will be run manually on the results.
-dir [directory name] Writes output to this directory (default is query file directory, "-dir ." will write to current directory). -a(lignments) Writes alignments in .align output file; (not working with -wublast) -inv Alignments are presented in the orientation of the repeat (with option -a) -lcambig Outputs ambiguous DNA transposon fragments using a lower case name. All other repeats are listed in upper case. Ambiguous fragments match multiple repeat elements and can only be called based on flanking repeat information. -small Returns complete .masked sequence in lower case -xsmall Returns repetitive regions in lowercase (rest capitals) rather than masked -x Returns repetitive regions masked with Xs rather than Ns -poly Reports simple repeats that may be polymorphic (in file.poly) -source Includes for each annotation the HSP "evidence". Currently this option is only available with the "-html" output format listed below. -html Creates an additional output file in xhtml format. -ace Creates an additional output file in ACeDB format -gff Creates an additional Gene Feature Finding format output -u Creates an additional annotation file not processed by ProcessRepeats -xm Creates an additional output file in cross_match format (for parsing) -fixed Creates an (old style) annotation file with fixed width columns -no_id Leaves out final column with unique ID for each element (was default) -e(xcln) Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in the query
#PBS -N RepeatMasker_test #PBS -l walltime=4:00:00 #PBS -l nodes=1:ppn=4 module load biosoftw module load RepeatMasker cp /usr/local/biosoftw/bowtie-0.12.7/genomes/NC_008253.fna . RepeatMasker –pa 4 NC_008253.fna
The following commands result in errors:
RepeatMasker -w, RepeatMasker -de, RepeatMasker -e.