Supercomputing Networking Research Education Ohio Supercomputer Center Site Map Staff Directory Support
Supercomputing image

mpiexec

Introduction

mpiexec is a replacement program for the script mpirun, which is part of the mpich package. It is used to initialize a parallel job from within a pbs batch or interactive environment. It further generates the environment variables and configuration files necessary to intialize a parallel program using the GM message-passing library for Myrinet.

mpiexec uses the task manager library tm(3B), of pbs(1B), to spawn copies of the executable on all the nodes in a pbs allocation. It is functionally equivalent to

rsh node "cd $cwd; $SHELL -c exec executable arguments"

using the current working directory from where mpiexec is invoked, and the shell specified in the environment, or from the password file.

Unless you specify the -bg option, the standard input of the mpiexec process is forwarded to task number zero in the parallel job, allowing for use of the construct

mpiexec mycode < inputfile

This behavior can be modified using the -nostdin or -allstdin flags. Standard output and error are also forwarded to mpiexec, allowing redirection of the outputs of all processes. This can be turned off using -nostdout so that the standard output and error streams go through the normal PBS mechanisms, to the batch job output files, or to your terminal in the case of an interactive job. See qsub(1) for more information.

Version

Version 0.66 is currently available at OSC.

Availability

mpiexec is available on the Itanium 2 Cluster, Altix and the Glenn Cluster.

Usage

mpiexec is in the standard execute path as /usr/local/bin/mpiexec. It is executed as:

mpiexec options... executable args
mpiexec options... -config configfile

Options

All options may be introduced using either a single dash, or double dashes as are common in most gnu utilities. ptions may be shortened so long as they remain unambiguous. Options which require arguments may appear as separate words in the argument list, or they may be separated from the option by an equals sign, again, as is popular in gnu utilities.

  • -n numproc
    Use only the specified number of processes. Default is to use all which were provided in the pbs environment.
  • -verbose
    Talk more about what mpiexec is doing.
  • -nostdin
    Do not connect the standard input stream of process 0 to the mpiexec process. If the process attempts to read from stdin, it will see an end-of-file.
  • -allstdin
    Send the standard input stream of mpiexec to all processes. Each character typed to mpiexec (or read from a file) is duplicated numproc times, and sent to each process. This permits every process to read, for example, configuration information from the input stream.
  • -nostdout
    Do not connect the standard output and error streams of each process back to the mpiexec process. Output on these streams will go through the normal PBS mechanisms instead, to wit: files of the form job.ojobid and job.ejobid for batch jobs, and directly to the controlling terminal for interactive jobs.
  • -comm type
    Specify the communication library used by your code. Each MPI library has different mechanisms for starting all the processes of a parallel job, thus you must specify to mpiexec which library you use so that it can set up the environment of the processes correctly. The argument type must be one of: mpich-gm, mpich-p4, lam, emp, none; although the code may not have been compiled with support for some of those.
  • -pernode (SMP only)
    Allocate only one process per compute node. For SMP nodes, only one processor will be allocated a job. This flag is used to implement multiple level parallelism with MPI between nodes, and threads within a node, assmuming the code is set up to do that. Like -perif, this flag also utilizes only some of the processors allocated to the job by pbs.
  • -perif (MPICH/GM with multiple myrinet cards only)
    Allocate only one process per myrinet interface. If the pbs node list specification includes processors which would share a myrinet interface, only one of the conflicting processors in each set is chosen to run a job. This flag can be used to ensure maximum communication bandwidth available to each process, at the expense of wasting some processors.
  • -no-shmen (MPICH/GM on SMP only)
    Instruct GM not to use shared memory communications between processes on the same SMP node. Using shared memory (the default) has generally higher throughput and always lower latency, but might result in heavy cache misses.
  • -gige (MPICH/P4 or EMP only)
    Use an alternate hostname for message passing. Processes will be spawned using a separate namespace for their message passing communications. This is necessary if you use, say, one ethernet card and namespace within PBS, and another ethernet card for message passing. Now, this assumes hostnames are of the form node<num>, and translates them to gige<num>. Later versions will hopefully be more flexible.
  • -totalview, -tv
    Debug using totalview. The process on node zero attempts to open an X window to $DISPLAY, and all processes are attached by totalview threads. See totalview(1) for more information.
  • -kill
    If any one of the processes dies, wait a little, then kill all the other processes in the parallel job. Your message passing library should handle this for you in most circumstances.
  • -config configfile
    Process executable and arguments are specified in the given configuration file. This flag permits the use of heterogeneous jobs using multiple executables, architectures, and command line arguments. No executable is given on the command line when using the -config flag. If configfile is "-", then the configuration is read from standard input. In this case the flag -nostdin is mandatory, as it is not possible to separate the contents of the configuration file from process input.
  • -version
    Display the mpiexec number and configure arguments.

CONFIG File

Each line of a configuration file contains a node specification and a command line, separated by a single colon (:). A command line consists of an executable name and arguments to be passed to that executable, just like when running mpiexec without a config file. A node specification can be either:

  • -n numproc
    Run the executable on a certain number of processors.
  • nodespec
    Run the executable on the named nodes specified by nodespec

A nodespec is a space-separated list of host-names. Each element in the list is interpreted using case-insensitive standard shell wildcard patterns (see glob(7) and fnmatch(3)), to produce multiple hostnames, possibly. It is not an error to specify nodes in the nodespec which are not actually part of the pbs allocation. This allows a single generic configuration file to be used in multiple situations.

config file Example

node03 node04 node1* : myexec -s 4
-n 5 : otherexe -f 2 -large

Run the code myexe on node03,node04, and any machine with a hostname matching node1*. Pick five other nodes on which to run otherexe.

Note that each node listed in a node specification is chosen only once to run a given process. If using multi-processor nodes, and you do want to run two or more copies of the code on a given node, list that node twice, or duplicate the config file entry. Also note that node-anonymous specifications (e.g., -n 6) may choose other processors on a node which already has processes assigned; use the -pernode flag on the command line if you want node-exclusive behavior.

There is no way to run more than one process per processor using mpiexec. You must explicitly spawn threads in your code if you wish to do this. The presence of a -n argument on the command line limits the total number of processors available to the configuration file selection process, just as the flags -perif and -pernode limit the available nodes.

Examples

mpiexec a.out

Run the executable a.out as a parallel mpi code on each process allocated by pbs.

mpiexec -n 2 a.out -b 4

Run the code with arguments -b 4 on only two processors.

mpiexec -pernode -conf my.config

Run only one process on each node, using the nodes and executables listed in the configuration file my.config.

mpiexec mycode >out 2>err

Using a sh-compatible shell, send the standard output of all processes to the file out, and the standard error to err.

mpiexec mycode >& output

Using a csh-compatible shell, combine the standard output and error streams of all processes to the file output.

mpiexec mycode | sort > output

Sort the output of the processes. Standard error will appear as the standard error of the mpiexec process.

mpiexec -comm none -pernode mkdir /tmp/my-temp-dir

Run the standard unix command mkdir on each of the SMP nodes in your PBS allocation for this job.

mpiexec -comm mpich-p4 mycode-p4

Run a code compiled using MPICH/P4, even though your system administrator has chosen MPICH/GM as a default.

Environment Variables

mpiexec uses PBS_JOBID as deposited in the environment by pbs to contact the pbs daemons. When looking for the executable to run, he PATH environment variable is consulted, as well as searching in the current working directory, and jobs are started using SHELL on all the nodes. For totalview debugging runs, the settings in DISPLAY and LM_LICENSE_FILE may be important.

Note that mpiexec does pass all variables in the environment which it was given, but PBS will not copy your entire environment for batch jobs at job submission time unless you invoke qsub using the -V argument.

Files

$HOME/.gmpiconf.job

Contains the processor to Myrinet device/port mapping needed by the MPICH/GM communications library. Created and deleted, if no errors occurred, by mpiexec for each job.

/tmp/gmpi-shmem.PBS_JOBID

Shared memory file used by GM on a node. Automatically created and destroyed by GM.

Errors

tm: not connected

A fatal error occurred in communications between the mpiexec process and the local pbs_mom. This might occur due to bugs in pbs_mom, and is not recoverable.

Documentation

The mpiexec command is documented as a man page: man mpiexec.

See Also
mpirun(1), pbs(1B), tm(3B), qsub(1B), totalview(1)

Authors
Pete Wyckoff <pw@osc.edu>
Dave Heisterberg
Doug Johnson <djohnson@osc.edu>