"RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program." (http://www.repeatmasker.org/)
Bioinformatics & Biology
"TreeBeST is an original tree builder for constrained neighbour-joining and tree merge, an efficient tool capable of duplication/loss/ortholog inference, and a versatile program facili- tating many tree-building routines, such as tree rooting, alignment filtering and tree plot- ting. TreeBeST stands for ‚Äò(gene) Tree Building guided by Species Tree‚Äô. It is previously known as NJTREE as the first piece of codes of this project aimed to build a eighbour-joining tree.
TreeBeST is the core engine of TreeFam (Tree Families Database) project initiated by Richard Durbin. The basic idea of this project is to build a full tree constrained by a manually verified seed tree. The tree builder must know how to utilize the prior knowledge provided by human experts. This demand disqualifies any existing softwares. Given this fact, we devised a new algorithm to control the joining step of traditional neighbour-joining. This is origin the constrained neighbour-joining.
When trees are built, they are only meaningful to biologists. Computers generate trees, but they do not understand them. To understand gene trees, a computer must be equipped with some biological knowledges, the species tree. It will teach a computer how to discriminate a speciation from a duplication event and how to find orthologs, provided a correct gene tree.
Unfortunately, gene trees are not always correct. Since the advent of UPGMA algorithm in 1958, we have tried to find a ideal model for nearly half a century. But we failed. Evolution is so complex a thing. A model best fits in one lineage might mean a disaster in another. A unified model is far from being discovered. TreeBeST aims at improving the accuracy of tree building, but it does not try to set up a new model in a traditional way. Instead, it integrates two existing models with the help of species tree, finding the subtree that best fits the models and merging them together to build a new tree incorporating the advantages of the both. This is the tree algorithm." (treebest.pdf)
“Proper identification of repetitive sequences is an essential step in genome analysis. The RECON package performs de novo identification and classification of repeat sequence families from genomic sequences. The underlying algorithm is based on extensions to the usual approach of single linkage clustering of local pairwise alignments between genomic sequences. Specifically, our extensions use multiple alignment information to define the boundaries of individual copies of the repeats and to distinguish homologous but distinct repeat element families. RECON should be useful for first-pass automatic classification of repeats in newly sequenced genomes.” (http://selab.janelia.org/recon.html)
"RAxML is a fast implementation of maximum-likelihood (ML) phylogeny estimation that operates on both nucleotide and protein sequence alignments."
mpiBLAST is a freely available, open-source, parallel implementation of NCBI BLAST. mpiBLAST takes advantage of distributed computational resources, i.e., a cluster, through explicit MPI communication and thereby utilizes all available resources unlike standard NCBI BLAST which can only take advantage of shared-memory multi-processors (SMPs).
"Bowtie is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. It aligns 35-base-pair reads to the human genome at a rate of 25 million reads per hour on a typical workstation. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: for the human genome, the index is typically about 2.2 GB (for unpaired alignment) or 2.9 GB (for paired-end or colorspace alignment). Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie can also output alignments in the standard SAM format, allowing Bowtie to interoperate with other tools supporting SAM, including the SAMtools consensus, SNP, and indel callers. Bowtie runs on the command line." (http://bowtie-bio.sourceforge.net/manual.shtml)
PAUP is a leading program for performing phylogenetic analysis for bioinformatics sequences. PAUP currently runs as a single processor program. No further enhancements are suggested.
MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on the observations. The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.
Profile hidden Markov models (profile HMMs) can be used to do sensitive database searching using statistical descriptions of a sequence family's consensus. HMMER uses profile HMMs, and can be useful in situations like:
- if you are working with an evolutionarily diverse protein family, a BLAST search with any individual sequence may not find the rest of the sequences in the family.
- the top hits in a BLAST search are hypothetical sequences from genome projects.
- your protein consists of several domains which are of different types.
HMMER (pronounced 'hammer', as in a more precise mining tool than BLAST) was developed by Sean Eddy at Washington University in St. Louis.
HMMER is a very cpu-intensive program and is parallelized using threads, so that each instance of hmmsearch or the other search programs can use all the cpus available on a node. HMMER on OSC clusters are intended for those who need to run HMMER searches on large numbers of query sequences.
EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. Also, as extensive libraries are provided with the package, it is a platform to allow other scientists to develop and release software in true open source spirit. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole.
Within EMBOSS you will find around hundreds of programs (applications) covering areas such as:
- Sequence alignment,
- Rapid database searching with sequence patterns,
- Protein motif identification, including domain analysis,
- Nucleotide sequence pattern analysis---for example to identify CpG islands or repeats,
- Codon usage analysis for small genomes,
- Rapid identification of sequence patterns in large scale sequence sets,
- Presentation tools for publication