mpiBLAST is a freely available, open-source, parallel implementation of NCBI BLAST. mpiBLAST takes advantage of distributed computational resources, i.e., a cluster, through explicit MPI communication and thereby utilizes all available resources unlike standard NCBI BLAST which can only take advantage of shared-memory multi-processors (SMPs).
Availability & Restrictions
mpiBLAST is available without restriction to all OSC users.
The following version of mpiBLAST are available on OSC systems:
To load the mpiBLAST software on the Glenn system, use the following commands:
module load biosoftw module load mpiblast
On the Oakley system, use the following command:
module load mpiblast
Once mpiblast module is loaded, the commands are available for your use.
mpiblast mpiblast_cleanup mpiformatdb
Formatting a database
Before processing blast queries the sequence database must be formatted with
mpiformatdb. The command line syntax looks like this:
mpiformatdb -N 16 -i nt -o T
The above command would format the nt database into 16 fragments. Note that currently mpiformatdb does not support multiple input files.
mpiformatdb places the formatted database fragments in the same directory as the FASTA database. To specify a different target location, use the "-n" option as what is available in the NCBI formatdb.
Querying the database
mpiblast command line syntax is nearly identical to NCBI's
blastall program. Running a query on 18 nodes would look like:
mpiexec -n 18 mpiblast -p blastn -d nt -i blast_query.fas -o blast_results.txt
The above command would query the sequences in
blast_query.fas against the
nt database and write out results to the
blast_results.txt file in the current working directory. By default, mpiBLAST reads configuration information from
~/.ncbirc. Furthermore, mpiBLAST needs at least 3 processes to perform a search: two processes dedicated for scheduling tasks and coordinating file output, while any additional processes actually perform search tasks.
Extra options to mpiblast
Enable hierarchical scheduling with multiple masters. The partition size equals the number of workers in a partition plus 1 (the master process). For example, a partition size of 17 creates partitions consisting of 16 workers and 1 master. An individual output file will be generated for each partition. By default, mpiBLAST uses one partition. This option is only available for version 1.6 or above.
Specify how database fragments are replicated within a partition. Suppose the total number of database fragments is F, the number of MPI processes in a partition is N, and the replica-group-size is G, then in total (N-1)/G database replicas will be distributed in the partition (the master process does not host any database fragments), and each worker process will host F/G fragments. In other words, a database replica will be distributed to every G MPI processes.
The default value is 5. Specify the number of query sequences that will be fetched from the supermaster to the master at a time. This parameter controls the granularity of load balancing between different partitions. This option is only available for version 1.6 or above.
Enable the high-performance parallel output solution. Note the current implementation of parallel-write does not require a parallel file system.
Enable workers to cache database fragments in memory instead of local storage. This is recommended on diskless platforms where there is no local storage attaching to each processor. Default to be enabled on Blue Gene systems.
Distribute database fragments to workers before the search begins. Especially useful in reducing data input time when multiple database replicas need to be distributed to workers.
Enable output of the search statistics in the pairwise and XML output format. This could cause performance degradation on some diskless systems such as Blue Gene.
Removes the local copy of the database from each node before terminating execution.
Sets the method of copying files that each worker will use. Default = "cp"
- cp : use standard file system "cp" command. Additional option is --concurrent.
- rcp : use rsh "rcp" command. Additonal option is --concurrent.
- scp : use ssh "scp" command. Additional option is --concurrent.
- mpi : use MPI_Send/MPI_Recv to copy files. Additional option is --mpi-size.
- none : do not copy files, instead use shared storage as local storage.
Produces verbose debugging output for each node, optionally logs the output to a file.
Reports execution time profile.
Print the mpiBLAST version.
Please refer to the README file in the mpiBLAST package for performance tuning guide.
Removing a database
--removedb command line option will cause mpiBLAST to do all work in a temporary directory that will get removed from each node's local storage directory upon successful termination. For example:
mpiexec -n 18 mpiblast -p blastx -d yeast.aa -i ech_10k.fas -o results.txt --removedb
The above command would perform a 18 node (16 worker) search of the
yeast.aa database, writing the output to
results.txt. Upon completion, worker nodes would delete the
yeast.aa database fragments from their local storage.
Databases can also be removed without performing a search in the following manner:
mpiexec -n 18 mpiblast_cleanup
Below is a sample batch script for running mpiBLAST job. It asks for 24 processors and 30 minutes of walltime.
#PBS -l walltime=30:00 #PBS -l nodes=1:ppn=12 #PBS -N mpiBLAST #PBS -j oe cp /usr/local/mpiblast/1.6.0/.ncbirc ./ module load mpiblast # copy data over to $TMPDIR on compute node cd $PBS_O_WORKDIR cp query.fasta $TMPDIR cp db/benchmark.fasta* $TMPDIR # Break the database into 10 pieces cd $TMPDIR /usr/bin/time mpiformatdb -N 10 -i benchmark.fasta -o T -p T cp benchmark.fasta* /nfs/proj01/PZS0002/biosoftw/db/ # run mpiblast /usr/bin/time mpiexec -n 12 mpiblast -p blastp -d benchmark.fasta -i query.fasta -o blast_results.txt # Copy output back to working directory mkdir $PBS_O_WORKDIR/$PBS_JOBID cp blast_results.txt $PBS_O_WORKDIR/$PBS_JOBID cd $PBS_O_WORKDIR