Batch Processing at OSC

OSC has recently switched schedulers from PBS to Slurm.
Please see the slurm migration pages for information about how to convert commands.

Batch processing

Efficiently using computing resources at OSC requires using the batch processing system. Batch processing refers to submitting requests to the system to use computing resources.

The only access to significant resources on the HPC machines is through the batch process. This guide will provide an overview of OSC's computing environment, and provide some instruction for how to use the batch system to accomplish your computing goals.

The menu at the right provides links to all the pages in the guide, or you can use the navigation links at the bottom of the page to step through the guide one page at a time. If you need additional assistance, please do not hesitate to contact OSC Help.

Batch System Concepts

The only access to significant resources on the HPC machines is through the batch process.

Why use a batch system?

Access to the OSC clusters is through a system of login nodes. These nodes are reserved solely for the purpose of managing your files and submitting jobs to the batch system. Acceptable activities include editing/creating files, uploading and downloading files of moderate size, and managing your batch jobs. You may also compile and link small-to-moderate size programs on the login nodes.

CPU time and memory usage are severely limited on the login nodes. There are typically many users on the login nodes at one time. Extensive calculations would degrade the responsiveness of those nodes.

If a process is started on the login nodes that is using too much cpu or memory, then it may be killed without warning.

The batch system allows users to submit jobs requesting the resources (nodes, processors, memory, GPUs) that they need. The jobs are queued and then run as resources become available. The scheduling policies in place on the system are an attempt to balance the desire for short queue waits against the need for efficient system utilization.

Interactive vs. batch

When you type commands in a login shell and see a response displayed, you are working interactively. To run a batch job, you put the commands into a text file instead of typing them at the prompt. You submit this file to the batch system, which will run it as soon as resources become available. The output you would normally see on your display goes into a log file. You can check the status of your job interactively and/or receive emails when it begins and ends execution.

Terminology

The batch system used at OSC is SLURM. A central manager slurmctld, monitors resources and work. You’ll need to understand the terms cluster, node, and processor (core) in order to request resources for your job. See HPC basics if you need this background information.

The words “parallel” and “serial” as used by SLURM can be a little misleading. From the point of view of the batch system a serial job is one that uses just one node, regardless of how many processors it uses on that node. Similarly, a parallel job is one that uses more than one node. More standard terminology considers a job to be parallel if it involves multiple processes.

Batch processing overview

Here is a very brief overview of how to use the batch system.

Choose a cluster

Before you start preparing a job script you should decide which cluster you want your job to run on, Owens or Pitzer. This decision will probably be based on the resources available on each system. Remember which cluster you’re using because the batch systems are independent.

Prepare a job script

Your job script is a text file that includes SLURM directives as well as the commands you want executed. The directives tell the batch system what resources you need, among other things. The commands can be anything you would type at the login prompt. You can prepare the script using any editor.

Submit the job

You submit your job to the batch system using the sbatch command, with the name of the script file as the argument. The sbatch command responds with the job ID that was given to your job, typically a 6- or 7-digit number.

Wait for the job to run

Your job may wait in the queue for minutes or days before it runs, depending on system load and the resources requested. It may then run for minutes or days. You can monitor your job’s progress or just wait for an email telling you it has finished.

Retrieve your output

The log file (screen output) from your job will be in the directory you submitted the job from by default. Any other output files will be wherever your script put them.

Supercomputer:

Owens

Pitzer

Batch Execution Environment

Shell and initialization

Your batch script executes in a shell on a compute node. The environment is identical to what you get when you connect to a login node except that you have access to all the resources requested by your job. The shell that Slurm uses is determined by the first line of the job script (it is by default #!/bin/bash). The appropriate “dot-files” ( .login , .profile , .cshrc ) will be executed, the same as when you log in. (For information on overriding the default shell, see the Job Scripts section.)

The job begins in the directory that it was submitted from. You can use the cd command to change to a different directory. The environment variable $SLURM_SUBMIT_DIR makes it easy to return to the directory from which you submitted the job:

cd $SLURM_SUBMIT_DIR

Modules

There are dozens of software packages available on OSC’s systems, many of them with multiple versions. You control what software is available in your environment by loading the module for the software you need. Each module sets certain environment variables required by the software.

If you are running software that was installed by OSC, you should check the software documentation page to find out what modules to load.

Several modules are automatically loaded for you when you login or start a batch script. These default modules include

modules required by the batch system
the Intel compiler suite
an MPI package compatible with the default compiler (for parallel computing)

The module command has a number of subcommands. For more details, type module help.

Certain modules are incompatible with each other and should never be loaded at the same time. Examples are different versions of the same software or multiple installations of a library built with different compilers.

Note to those who build or install their own software: Be sure to load the same modules when you run your software that you had loaded when you built it, including the compiler module.

Each module has both a name and a version number. When more than one version is available for the same name, one of them is designated as the default. For example, the following modules are available for the Intel compilers on Owens: (Note: The versions shown might be out of date but the concept is the same.)

intel/12.1.0 (default)
intel/12.1.4.319

If you specify just the name, it refers to the default version or the currently loaded version, depending on the context. If you want a different version, you must give the entire string including the version information.

You can have only one compiler module loaded at a time, either intel, pgi, or gnu. The intel module is loaded initially; to change to pgi or gnu, do a module swap (see example below).

Some software libraries have multiple installations built for use with different compilers. The module system will load the one compatible with the compiler you have loaded. If you swap compilers, all the compiler-dependent modules will also be swapped.

Special note to gnu compiler users: While the gnu compilers are always in your path, you should load the gnu compiler module to ensure you are linking to the correct library versions.

To list the modules you have loaded:

module list

To see all modules that are compatible with your currently loaded modules:

module avail

To see all modules whose names start with fftw:

module avail fftw

To see all possible modules:

module spider

To see all possible modules whose names start with fftw:

module spider fftw

To load the fftw3 module that is compatible with your current compiler:

module load fftw3

To unload the fftw3 module:

module unload fftw3

To load the default version of the abaqus module (not compiler-dependent):

module load abaqus

To load a different version of the abaqus module:

module load abaqus/6.8-4

To unload whatever abaqus module you have loaded:

module unload abaqus

To unload all modules:

module purge

To reset to default starting modules:

module reset

To swap the intel compilers for the pgi compilers (unloads intel, loads pgi):

module swap intel pgi

To swap the default version of the intel compilers for a different version:

module swap intel intel/12.1.4.319

To display help information for the mkl module:

module help mkl

To display the commands run by the mkl module:

module show mkl

To use a locally installed module, first import the module directory:

module use [/path/to/modulefiles]

And then load the module:

module load localmodule

Slurm environment variables

Your batch execution environment has all the environment variables that your login environment has plus several that are set by the batch system. This section gives examples for using some of them. For more information see man sbatch.

Directories

Several directories may be useful in your job.

The absolute path of the directory your job was submitted from is $SLURM_SUBMIT_DIR.

Each job has a temporary directory, $TMPDIR , on the local disk of each node assigned to it. Access to this directory is much faster than access to your home or project directory. The files in this directory are not visible from all the nodes in a parallel job; each node has its own directory. The batch system creates this directory when your job starts and deletes it when your job ends. To copy file input.dat to $TMPDIR on your job’s first node:

cp input.dat $TMPDIR

For parallel job, to copy file input.dat to $TMPDIR on all your job’s nodes:

sbcast input.dat $TMPDIR/input.dat

Each job also has a temporary directory, $PFSDIR , on the parallel scratch file system, if users add node attribute "pfsdir" in the batch request (--gres=pfsdir). This is a single directory shared by all the nodes a job is running on. Access is faster than access to your home or project directory but not as fast as $TMPDIR . The batch system creates this directory when your job starts and deletes it when your job ends. To copy the file output.dat from this directory to the directory you submitted your job from:

cp $PFSDIR/output.dat $SLURM_SUBMIT_DIR

The $HOME environment variable refers to your home directory. It is not set by the batch system but is useful in some job scripts. It is better to use $HOME than to hardcode the path to your home directory. To access a file in your home directory:

cat $HOME/myfile

Job information

A list of the nodes and cores assigned to your job is obtained using srun hostname |sort -n

For GPU jobs, a list of the GPUs assigned to your job is in the file $SLURM_GPUS_ON_NODE. To display this file:

cat $SLURM_GPUS_ON_NODE

If you use a job array, each job in the array gets its identifier within the array in the variable $SLURM_ARRAY_JOB_ID. To pass a file name parameterized by the array ID into your application:

./a.out input_$SLURM_ARRAY_JOB_ID.dat

To display the numeric job identifier assigned by the batch system:

echo $SLURM_JOB_ID

To display the job name:

echo $SLURM_JOB_NAME

Use fast storage

If your job does a lot of file-based input and output, your choice of file system can make a huge difference in the performance of the job.

Shared file systems

Your home directory is located on shared file systems, providing long-term storage that is accessible from all OSC systems. Shared file systems are relatively slow. They cannot handle heavy loads such as those generated by large parallel jobs or many simultaneous serial jobs. You should minimize the I/O your jobs do on the shared file systems. It is usually best to copy your input data to fast temporary storage, run your program there, and copy your results back to your home directory.

Batch-managed directories

Batch-managed directories are temporary directories that exist only for the duration of a job. They exist on two types of storage: disks local to the compute nodes and a parallel scratch file system.

A big advantage of batch-managed directories is that the batch system deletes them when a job ends, preventing clutter on the disk.

A disadvantage of batch-managed directories is that you can’t access them after your job ends. Be sure to include commands in your script to copy any files you need to long-term storage. To avoid losing your files if your job ends abnormally, for example by hitting its walltime limit, include a trap command in your script (Note: trap commands do not work in csh and tcsh shell batch scripts). The following example creates a subdirectory in $SLURM_SUBMIT_DIR and copies everything from $TMPDIR into it in case of abnormal termination.

trap "cd $SLURM_SUBMIT_DIR;mkdir $SLURM_JOB_ID;cp -R $TMPDIR/* $SLURM_SUBMIT_DIR;exit" TERM

If a node your job is running on crashes, the trap command may not be executed. It may be possible to recover your batch-managed directories in this case. Contact OSC Help for assistance. For other details on retrieving files from unexpectedly terminated jobs, see this FAQ.

Local disk space

The fastest storage is on a disk local to the node your job is running on, accessed through the environment variable $TMPDIR . The main drawback to local storage is that each node of a parallel job has its own directory and cannot access the files on other nodes.

Local disk space should be used only through the batch-managed directory created for your job. Please do not use /tmp directly because your files won’t be cleaned up properly.

Parallel file system

The parallel file system, including project directory and scratch directory, is faster than the shared file systems for large-scale I/O and can handle a much higher load. It is efficient for reading and writing data in large blocks and should not be used for I/O involving many small accesses.

The scratch file system can be used through the batch-managed directory created for your job. The path for this directory is in the environment variable $PFSDIR . You should use it when your files must be accessible by all the nodes in your job and also when your files are too large for the local disk.

You may also create a directory for yourself in scratch file system and use it the way you would use any other directory. This directory will not be backed up; files are subject to deletion after some number of months.

Note: You should not copy your executable files to $PFSDIR. They should be run from your home directories or from $TMPDIR.

Supercomputer:

Owens

Pitzer

Job Scripts

Known Issue

The usage of combing the --ntasks and --ntasks-per-node options in a job script can cause some unexpected resource allocations and placement due to a bug in Slurm 23. OSC users are strongly encouraged to review their job scripts for jobs that request both --ntasks and --ntasks-per-node. Jobs should request either --ntasks or --ntasks-per-node, not both.

A job script is a text file containing job setup information for the batch system followed by commands to be executed. It can be created using any text editor and may be given any name. Some people like to name their scripts something like myscript.job or myscript.sh, but myscript works just as well.

A job script is simply a shell script. It consists of Slurm directives, comments, and executable statements. The # character indicates a comment, although lines beginning with #SBATCH are interpreted as Slurm directives. Blank lines can be included for readability.

SBATCH header lines
Resource limits
Executable section
Considerations for parallel jobs
Batch script examples

SBATCH header lines

A job script must start with a shabang #! (#!/bin/bash is commonly used but you can choose others) following by several lines starting with #SBATCH. These are Slurm SBATCH directives or header lines. They provide job setup information used by Slurm, including resource requests, email options, and more. The header lines may appear in any order, but they must precede any executable lines in your script. Alternatively, you may provide these directives (without the #SBATCH notation) on the command line with the sbatch command.

$ sbatch --jobname=test_job myscript.sh

Resource limits

Options used to request resources, including nodes, memory, time, and software flags, as described below.

Walltime

The walltime limit is the maximum time your job will be allowed to run, given in seconds or hours:minutes:seconds. This is elapsed time. If your job exceeds the requested time, the batch system will kill it. If your job ends early, you will be charged only for the time used.

The default value for walltime is 1:00:00 (one hour).

To request 20 hours of wall clock time:

#SBATCH --time=20:00:00

It is important to carefully estimate the time your job will take. An underestimate will lead to your job being killed. A large overestimate may prevent your job from being backfilled or fitting into an empty time slot.

Tasks, cores (cpu), nodes and GPUs

Resource limits specify not just the number of nodes but also the properties of those nodes. The properties differ between clusters but may include the number of cores per node, the number of GPUs per node (gpus), and the type of node.

SLURM uses the term task, which can be thought of as number of processes started.

Making sure that the number of tasks versus cores per task is important when using an mpi launcher such as srun.

Serial job

A serial job in this context refers to a job requesting resources that are included in a single node.
e.g. A node contians 40 cores, and a job requests 20 cores. Another job requests 40 cores of the 40 core node.
These are serial jobs.

To request one CPU core (sequential job), do not add any SLURM directives. The default is one node, one core, and one task.

To request 6 CPU cores on one node, in a single process:

#SBATCH --ntasks-per-node=6

Parallel job

To request 4 nodes and run a task on each which uses 40 cores:

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=10

To request 4 nodes with 10 tasks per node (the default is 1 core per task, unless using --ntasks-per-node to set manually):

#SBATCH --nodes=4 --ntasks-per-node=10

Under our current scheduling policy a parallel job (which uses more than one node) is always given full nodes. You can easily use just part of each node even if the entire nodes are allocated (see the section srun in parallel jobs).

Computing nodes on Pitzer cluster have 40 or 48 cores per node. The job can be constrained on 40-core (or 48-core) nodes only by using --constraint:

#SBATCH --constraint=40core

GPU job

To request 2 nodes with 2 GPUs (2-GPU nodes are only available on Pitzer)

#SBATCH --nodes=2
#SBATCH --gpus-per-node=2

To request one node with use of 12 cores and 2 GPU:

#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --gpus-per-node=2

Memory

The memory limit is the total amount of memory needed across all nodes. There is no need to specify a memory limit unless you need a large-memory node or your memory requirements are disproportionate to the number of cores you are requesting. For parallel jobs you must multiply the memory needed per node by the number of nodes to get the correct limit; you should usually request whole nodes and omit the memory limit.

Default units are bytes, but values are usually expressed in megabytes (mem=4000MB) or gigabytes (mem=4GB).

To request 4GB memory (see note below):

#SBATCH --mem=4gb

#SBATCH --mem=4000mb

To request 24GB memory:

#SBATCH --mem=24000mb

Note: The amount of memory available per node is slightly less than the nominal amount. If you want to request a fraction of the memory on a node, we recommend you give the amount in MB, not GB; 24000MB is less than 24GB. (Powers of 2 vs. powers of 10 -- ask a computer science major.)

Software licenses

If you are using a software package with a limited number of licenses, you should include the license requirement in your script. See the OSC documentation for the specific software package for details.

Example requesting five abaqus licenses:

#SBATCH --licenses=abaqus@osc:5

Job name

You can optionally give your job a meaningful name. The default is the name of the batch script, or just "sbatch" if the script is read on sbatch's standard input. The job name is used as part of the name of the job log files; it also appears in lists of queued and running jobs. The name may be up to 15 characters in length, no spaces are allowed, and the first character must be alphabetic.

Example:

#SBATCH --job-name=my_first_job

Mail options

You may choose to receive email when your job begins, when it ends, and/or when it fails. The email will be sent to the address we have on record for you. You should use only one --mail-type=<type> directive and include all the options you want.

To receive an email when your job begins, ends or fails:

#SBATCH --mail-type=BEGIN,END,FAIL

To receive an email for all types:

#SBATCH --mail-type=ALL

The default email recipient is the submitting user, but you can include other users or email addresses:

#SBATCH --mail-user=osu1234,osu4321,username@osu.edu

Job log files

By default, Slurm directs both standard output and standard error to one log file. For job 123456, the log file will be named slurm-123456.out. You can specify name for the log file.

#SBATCH --output=myjob.out.%j

where the %j is replaced by the job ID.

Identify Project

Job scripts are required to specify a project account.

Get a list of current projects by using the OSCfinger command and looking in the SLURM accounts section:

OSCfinger userex
Login: userex                                     Name: User Example
Directory: /users/PAS1234/userex (CREATED)        Shell: /bin/bash
E-mail: user-ex@osc.edu
Contact Type: REGULAR
Primary Group: pas1234
Groups: pas1234,pas4321
Institution: Ohio Supercomputer Center
Password Changed: Dec 11 2020 21:05               Password Expires: Jan 12 2021 01:05 AM
Login Disabled: FALSE                             Password Expired: FALSE
SLURM Enabled: TRUE
SLURM Clusters: owens,pitzer
SLURM Accounts: pas1234,pas4321 <<===== Look at me !!
SLURM Default Account: pas1234
Current Logins:

To specify an account use:

#SBATCH --account=PAS4321

For more details on errors you may see when submitting a job, see messages from sbatch.

Executable section

The executable section of your script comes after the header lines. The content of this section depends entirely on what you want your job to do. We mention just two commands that you might find useful in some circumstances. They should be placed at the top of the executable section if you use them.

Command logging

The set -x command (set echo in csh) is useful for debugging your script. It causes each command in the batch file to be printed to the log file as it is executed, with a + in front of it. Without this command, only the actual display output appears in the log file.

To echo commands in bash or ksh:

set -x

To echo commands in tcsh or csh:

set echo on

Signal handling

Signals to gracefully and then immediately kill a job will be sent for various circumstances, for example if it runs out of wall time or is killed due to out-of-memory. In both cases, the job may stop before all the commands in the job script can be executed.

The sbatch flag --signal can be used to specify commands to be ran when these signals are received by the job.

Below is an example:

#!/bin/bash
#SBATCH --job-name=minimal_trap
#SBATCH --time=2:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --output=%x.%A.log
#SBATCH --signal=B:USR1@60

function my_handler() {
  echo "Catching signal"
  touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal
  cd $SLURM_SUBMIT_DIR
  mkdir $SLURM_JOB_ID
  cp -R $TMPDIR/* $SLURM_JOB_ID
  exit
}

trap my_handler USR1
trap my_handler TERM

my_process &
wait

It is typically used to copy output files from a temporary directory to a home or project directory. The following example creates a directory in $SLURM_SUBMIT_DIR and copies everything from $TMPDIR into it. This executes only if the job terminates abnormally. In some cases, even with signal handling, the job still may not be able to execute the handler.

The & wait is needed after starting the process so that user defined signal can be received by the process. See signal handling in slurm section of slurm migration issues for details.

For other details on retrieving files from unexpectedly terminated jobs see this FAQ.

Considerations for parallel jobs

Each processor on our system is fast, but the real power of supercomputing comes from putting multiple processors to work on a task. This section addresses issues related to multithreading and parallel processing as they affect your batch script. For a more general discussion of parallel computing see another document.

Multithreading involves a single process, or program, that uses multiple threads to take advantage of multiple cores on a single node. The most common approach to multithreading on HPC systems is OpenMP. The threads of a process share a single memory space.

The more general form of parallel processing involves multiple processes, usually copies of the same program, which may run on a single node or on multiple nodes. These processes have separate memory spaces. When they need to communicate or share data, these processes typically use the Message-Passing Interface (MPI).

A program may use multiple levels of parallelism, employing MPI to communicate between nodes and OpenMP to utilize multiple processors on each node.

For more details on building and running MPI/OpenMP software, see the programing environment pages for Pitzer cluster and Owens cluster.

While many executables will run on any of our clusters, MPI programs must be built on the system they will run on. Most scientific programs will run faster if they are built on the system where they’re going to run.

Script issues in parallel jobs

In a parallel job your script executes on just the first node assigned to the job, so it’s important to understand how to make your job execute properly in a parallel environment. These notes apply to jobs running on multiple nodes.

You can think of the commands (executable lines) in your script as falling into four categories.

Commands that affect only the shell environment. These include such things as cd, module, and export (or setenv). You don’t have to worry about these. The commands are executed on just the first node, but the batch system takes care of transferring the environment to the other nodes.
Commands that you want to have execute on only one node. These might include date or echo. (Do you really want to see the date printed 20 times in a 20-node job?) They might also include cp if your parallel program expects files to be available only on the first node. You don’t have to do anything special for these commands.
Commands that have parallel execution, including knowledge of the batch system, built in. These include sbcast (parallel file copy) and some application software installed by OSC. You should consult the software documentation for correct parallel usage of application software.
Any other command or program that you want to have execute in parallel must be run using srun. Otherwise, it will run on only one node, while the other nodes assigned to the job will remain idle. See examples below.

srun

The srun command runs a parallel job on cluster managed by Slurm. It is highly recommended to use srun while you run a parallel job with MPI libraries installed at OSC, including MVAPICH2, Intel MPI and OpenMPI.

The srun command has the form:

srun [srun-options] progname [prog-args]

where srun-options is a list of options to srun, progname is the program you want to run, and prog-args is a list of arguments to the program. Note that if the program is not in your path or not in your current working directory, you must specify the path as part of the name.

By default, srun runs as many copies of progname as there are tasks assigned to the job. For example, if your job requested --ntasks-per-node=8, the following command would run 8 a.out processes (with one core per task by default):

 srun a.out

The example above can be modified to pass arguments to a.out. The following example shows two arguments:

 srun a.out abc.dat 123

If the program is multithreaded, or if it uses a lot of memory, it may be desirable to run less processes per node. You can specify --ntasks-per-node to do this. By modifying the above example with --nodes=4, the following example would run 8 copies of a.out, two on each node:

 srun --ntasks-per-node=2 --cpus-per-task=20 a.out abc.dat 123
# start 2 tasks on each node, and each task is allocated 20 cores

System commands can also be run with srun. The following commands create a directory named data in the $TMPDIR directory on each node:

cd $TMPDIR
srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node=1 mkdir data

sbcast and sgather

If you use $TMPDIR in a parallel job, you probably want to copy files to or from all the nodes. The sbcast and sgather commands are used for this task.

To copy one file into the directory $TMPDIR on all nodes allocated to your job:

sbcast myprog $TMPDIR/myprog

To copy one file from the directory $TMPDIR on all nodes allocated to your job:

sgather -k $TMPDIR/mydata all_data

where the option -k will keep the file on the node, and all_data is the name of the file to be created with an appendix of source node name, meaning that you will see files all_data.node1_name, all_data.node2_name and more in the current working directory.

To recursively copy a directory from all nodes to the directory where the job is submitted:

sgather -k -r $TMPDIR $SLURM_SUBMIT_DIR/mydata

where mydata is the name of the directory to be created with an appendix of source node name.

You CANNOT use wildcard (*) as the name of the file or directory for sbcast and sgather.

Environment variables for MPI

If your program combines MPI and OpenMP (or another multithreading technique), you should disable processor affinity by setting the environment variable MV2_ENABLE_AFFINITY to 0 in your script. If you don’t disable affinity, all your threads will run on the same core, negating any benefit from multithreading.

To set the environment variable in bash, include this line in your script:

export MV2_ENABLE_AFFINITY=0

To set the environment variable in csh, include this line in your script:

setenv MV2_ENABLE_AFFINITY 0

Environment variables for OpenMP

The number of threads used by an OpenMP program is typically controlled by the environment variable $OMP_NUM_THREADS. If this variable isn't set, the number of threads defaults to the number of cores you requested per node, although it can be overridden by the program.

If your job runs just one process per node and is the only job running on the node, the default behavior is what you want. Otherwise, you should set $OMP_NUM_THREADS to a value that ensures that the total number of threads for all your processes on the node does not exceed the ppn value your job requested.

For example, to set the environment variable to a value of 40 in bash, include this line in your script:

export OMP_NUM_THREADS=40

For example, to set the environment variable to a value of 40 in csh, include this line in your script:

setenv OMP_NUM_THREADS 40

Note: Some programs ignore $OMP_NUM_THREADS and determine the number of threads programmatically.

Batch script examples

Simple sequential job

The following is an example of a single-task sequential job that uses $TMPDIR as its working area. It assumes that the program mysci has already been built. The script copies its input file from the directory into $TMPDIR, runs the code in $TMPDIR, and copies the output files back to the original directory.

#!/bin/bash
#SBATCH --account=pas1234
#SBATCH --job-name=myscience
#SBATCH --time=40:00:00

cp mysci.in $TMPDIR
cd $TMPDIR    
/usr/bin/time ./mysci > mysci.hist
cp mysci.hist mysci.out $SLURM_SUBMIT_DIR

Serial job with OpenMP

The following example runs a multi-threaded program with 8 cores:

#!/bin/bash
#SBATCH --account=pas1234
#SBATCH --job-name=my_job
#SBATCH --time=1:00:00
#SBATCH --ntasks-per-node=8

cp a.out $TMPDIR
cd $TMPDIR
export OMP_NUM_THREADS=8
./a.out > my_results
cp my_results $SLURM_SUBMIT_DIR

Simple parallel job

Here is an example of a parallel job that uses 4 nodes, running one process per core. To illustrate the module command, this example assumes a.out was built with the GNU compiler. The module swap command is necessary when running MPI programs built with a compiler other than Intel.

#!/bin/bash
#SBATCH --account=pas1234
#SBATCH --job-name=my_job
#SBATCH --time=10:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=28

module swap intel gnu
sbcast a.out $TMPDIR/a.out
cd $TMPDIR
srun a.out
sgather -k -r $TMPDIR $SLURM_SUBMIT_DIR/my_mpi_output

Notice that --ntasks-per-node is set based on a compute node in the owens cluster with 28 cores.
Make sure to refer to other cluster and node type core counts when adjusting this value. Cluster computing would be a good place to start.

Parallel job with MPI and OpenMP

This example is a hybrid (MPI + OpenMP) job. It runs one MPI process per node with X threads per process, where X must be less than or equal to physical cores per node (see the note below). The assumption here is that the code was written to support multilevel parallelism. The executable is named hybrid-program.

#!/bin/bash
#SBATCH --account=pas1234
#SBATCH --job-name=my_job
#SBATCH --time=20:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=28

export OMP_NUM_THREADS=14
export MV2_CPU_BINDING_POLICY=hybrid
sbcast hybrid-program $TMPDIR/hybrid-program
cd $TMPDIR
srun --ntasks-per-node=2 --cpus-per-task=14 hybrid-program
sgather -k -r $TMPDIR $SLURM_SUBMIT_DIR/my_hybrid_output

Note that computing nodes on Pitzer cluster have 40 or 48 cores per node and computing nodes on Owens cluster have 28 cores per node. If you want X to be all physical cores per node and to be independent of clusters, use the input environment variable SLURM_CPUS_ON_NODE:

export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE

Supercomputer:

Service:

Job Submission

Job scripts are submitted to the batch system using the sbatch command. Be sure to submit your job on the system you want your job to run on, or use the --cluster=<system> option to specify one.

Standard batch job

Most jobs on our system are submitted as scripts with no command-line options. If your script is in a file named myscript:

sbatch myscript

In response to this command you’ll see a line with your job ID:

Submitted batch job 123456

You’ll use this job ID (numeric part only) in monitoring your job. You can find it again using the squeue -u <username>

When you submit a job, the script is copied by the batch system. Any changes you make subsequently to the script file will not affect the job. Your input files and executables, on the other hand, are not picked up until the job starts running.

Interactive batch

The batch system supports an interactive batch mode. This mode is useful for debugging parallel programs or running a GUI program that’s too large for the login node. The resource limits (memory, CPU) for an interactive batch job are the same as the standard batch limits.

Interactive batch jobs are generally invoked without a script file.

Custom sinteractive command

OSC has developed a script to make starting an interactive session simpler.

The sinteractive command takes simple options and starts an interactive batch session automatically. However, its behavior can be counterintuitive with respect to numbers of tasks and CPUs. In addition, jobs launched with sinteractive can show environmental differences compared to jobs launched via other means. As an alternative, try, e.g.:

salloc -A <proj-code> --time=500

Simple serial

The example below demonstrates using sinteractive to start a serial interactive job:

sinteractive -A <proj-code>

The default if no resource options are specified is for a single core job to be submitted.

Simple parallel (single node)

To request a simple parallel job of 4 cores on a single node:

sinteractive -A <proj-code> -c 4

To setup for OpenMP executables then enter this command:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

Parallel (multiple nodes)

To request 2 whole nodes on Pitzer with a total of 96 cores between both nodes:

sinteractive -A <proj-code> -N 2 -n 96

But note that the slurm variables SLURM_CPUS_PER_TASK, SLURM_NTASKS, and SLURM_TASKS_PER_NODE are all 1, so subsequent srun commands to launch parallel executables must explicitly specify the task and cpu numbers desired. Unless one really needs to run in the debug queues it is in general simpler to start with an appropriate salloc command.

Use sinteractive --help to view all the options available and their default values.

Using salloc and srun

An example of using salloc and srun:

salloc --account=pas1234 --x11 --nodes=2 --ntasks-per-node=28 --time=1:00:00

The salloc command requests the resources. Job is interactive. The --x11 flag enables X11 forwarding, which is necessary with a GUI. You will need to have a X11 server running on your computer to use X11 forwarding, see the getting connected page. The remaining flags in this example are resource requests with the same meaning as the corresponding header lines in a batch file.

After you enter this line, you’ll see something like the following:

salloc: Pending job allocation 123456
salloc: job 123456 queued and waiting for resources

Your job will be queued just like any job. When the job runs, you’ll see the following line:

salloc: job 123456 has been allocated resources
salloc: Granted job allocation 123456
salloc: Waiting for resource configuration
salloc: Nodes o0001 are ready for job

At this point, you have an interactive login shell on one of the compute nodes, which you can treat like any other login shell.

It is important to remember that OSC systems are optimized for batch processing, not interactive computing. If the system load is high, your job may wait for hours in the queue, making interactive batch impractical. Requesting a walltime limit of one hour or less is recommended because your job can run on nodes reserved for debugging.

Job arrays

If you submit many similar jobs at the same time, you should consider using a job array. With a single sbatch command, you can submit multiple jobs that will use the same script. Each job has a unique identifier, $SLURM_ARRAY_TASK_ID, which can be used to parameterize its behavior.

Individual jobs in a job array are scheduled independently, but some job management tasks can be performed on the entire array.

To submit an array of jobs numbered from 1 to 100, all using the script sim.job:

sbatch --array=1-100 sim.job

The script would use the environment variable $SLURM_ARRAY_TASK_ID, possibly as an input argument to an application or as part of a file name.

Job dependencies

It is possible to set conditions on when a job can start. The most common of these is a dependency relationship between jobs.

For example, to ensure that the job being submitted (with script sim.job) does not start until after job 123456 has finished:

sbatch --dependency=afterany:123456 sim.job

Job variables

It is possible to provide a list of environment variables that are exported to the job.

For example, to pass the variable and its value to the job with the script sim.job, use the command:

sbatch --export=var=value sim.job

Many other options are available, some quite complicated; for more information, see the sbatch online manual by using the command:

man sbatch

Supercomputer:

Service:

Monitoring and Managing Your Job

Several commands allow you to check job status, monitor execution, collect performance statistics or even delete your job, if necessary.

Status of queued jobs

There are many possible reasons for a long queue wait — read on to learn how to check job status and for more about how job scheduling works.

squeue

Use the squeue command to check the status of your jobs, including whether your job is queued or running and information about requested resources. If the job is running, you can view elapsed time and resources used.

Here are some examples for user usr1234 and job 123456.

By itself, squeue lists all jobs in the system.

To list all the jobs belonging to a particular user:

squeue -u usr1234

To list the status of a particular job, in standard or alternate (more useful) format:

squeue -j 123456

To get more detail about a particular job:

squeue -j 123456 -l

You may also filter output by the state of a job. To view only running jobs use:

squeue -u usr1234 -t RUNNING

Other states can be seen in the JOB STATE CODES section of squeue man page using man squeue.

Additionally, JOB REASON CODES may be retrieved using the -l with the command man squeue. These codes describe the nodes allocated to running jobs or the reasons a job is pending, which may include:

Reason code "MaxCpuPerAccount": A user or group has reached the limit on the number of cores allowed. The rest of the user or group's jobs will be pending until the number of cores in use decreases.
Reason code "Dependency": Dependencies among jobs or conditions that must be met before a job can run have not yet been satisfied.

You can place a hold on your own job using scontrol hold jobid. If you do not understand the state of your job, contact OSC Help for assistance.

To list blocked jobs:

squeue -u usr1234 -t PENDING

The --start option estimates the start time for a pending job. Unfortunately, these estimates are not at all accurate except for the highest priority job in the queue.

Why isn’t my job running?

There are many reasons that your job may have to wait in the queue longer than you would like, including:

System load is high.
A downtime has been scheduled and jobs that cannot complete by the start of that downtime are not being started. Check the system notices posted on the OSC Events page or the message of the day, displayed when you log in.
You or your group are at the maximum processor count or running job count and your job is being held.
Your job is requesting specialized resources, such as GPU nodes or large memory nodes or certain software licenses, that are in high demand and not available.
Your job is requesting a lot of resources. It takes time for the resources to become available.
Your job is requesting incompatible or nonexistent resources and can never run.
Job is unnecessarily stuck in batch hold because of system problems (very rare).

Priority, backfill and debug reservations

Priority is a complicated function of many factors, including the processor count and walltime requested, the length of time the job has been waiting and more.

During each scheduling iteration, the scheduler will identify the highest priority job that cannot currently be run and find a time in the future to reserve for it. Once that is done, the scheduler will then try to backfill as many lower priority jobs as it can without affecting the highest priority job's start time. This keeps the overall utilization of the system high while still allowing reasonable turnaround time for high priority jobs. Short jobs and jobs requesting few resources are the easiest to backfill.

A small number of nodes are set aside during the day for jobs with a walltime limit of 1 hour or less, primarily for debugging purposes.

Observing a running job

You can monitor a running batch job as easily as you can monitor a program running interactively. Simply view the output file in read only mode to check the current output of the job.

Node status

You may check the status of a node while the job is running by visiting the OSC grafana page and using the "cluster metrics" report.

Managing your jobs

Deleting a job

Situations may arise that call for deletion of a job from the SLURM queue, such as incorrect resource limits, missing or incorrect input files or commands or a program taking too long to run (infinite loop).

The command to delete a batch job is scancel. It applies to both queued and running jobs.

Example:

scancel 123456

If you cannot delete one of your jobs, it may be because of a hardware problem or system software crash. In this case you should contact OSC Help.

Altering a queued job

You can alter certain attributes of a job in the queue using the scontrol update command. Use this command to make a change without losing your place in the queue. Please note that you cannot make any alterations to the executable portion of the script, nor can you make any changes after the job starts running.

The syntax is:

scontrol update job=<jobid> <args>

The optional arguments consist of one or more SLURM directives in the form of command-line options.

For example, to change the walltime limit on job 123456 to 5 hours and have email sent when the job ends (only):

scontrol update job=123456 timeLimit=5:00:00 mailType=End

Placing a hold on a queued job

If you want to prevent a job from running but leave it in the queue, you can place a hold on it using the scontrol hold command. The job will remain pending until you release it with the scontrol release command. A hold can be useful if you need to modify the input file for a job without losing your place in the queue.

Examples:

scontrol hold 123456
scontrol release 123456

Job statistics

Include the following commands in your batch script as appropriate to collect job statistics or performance information.

A simple way to view job information is to use this command at the end of the job:

scontrol show job=$SLURM_JOB_ID

XDMoD tool

You can use the online interactive tool XDMoD to look at usage statistics for jobs. See XDMoD overview for more information.

date

The date command prints the current date and time. It can be informative to include it at the beginning and end of the executable portion of your script as a rough measure of time spent in the job.

time

The time utility is used to measure the performance of a single command. It can be used for serial or parallel processes. Add /usr/bin/time to the beginning of a command in the batch script:

/usr/bin/time myprog arg1 arg2

The result is provided in the following format:

user time (CPU time spent running your program)
system time (CPU time spent by your program in system calls)
elapsed time (wallclock)
percent CPU used
memory, pagefault and swap statistics
I/O statistics

These results are appended to the job's error log file. Note: Use the full path “/usr/bin/time” to get all the information shown.

Supercomputer:

Owens

Pitzer

Scheduling Policies and Limits

The batch scheduler is configured with a number of scheduling policies to keep in mind. The policies attempt to balance the competing objectives of reasonable queue wait times and efficient system utilization. The details of these policies differ slightly on each system. Exceptions to the limits can be made under certain circumstances; contact oschelp@osc.edu for details.

Hardware limits

Each system differs in the number of processors (cores) and the amount of memory and disk they have per node. We commonly find jobs waiting in the queue that cannot be run on the system where they were submitted because their resource requests exceed the limits of the available hardware. Jobs never migrate between systems, so please pay attention to these limits.

Notice in particular the large number of standard nodes and the small number of large-memory nodes. Your jobs are likely to wait in the queue much longer for a large-memory node than for a standard node. Users often inadvertently request slightly more memory than is available on a standard node and end up waiting for one of the scarce large-memory nodes, so check your requests carefully.

See cluster computing for details on the number of nodes for each type.

Walltime limits per job

Serial jobs (that is, jobs which request only one node) can run for up to 168 hours, while parallel jobs may run for up to 96 hours.

Users who can demonstrate a need for longer serial job time may request access to the longserial queue, which allows single-node jobs of up to 336 hours. Longserial access is not automatic. Factors that will be considered include how efficiently the jobs use OSC resources and whether they can be broken into smaller tasks that can be run separately.

Limits per user and group

These limits are applied separately on each system.

An individual user can have up to 128 concurrently running jobs and/or up to 2040 processor cores in use on Pitzer. All the users in a particular group/project can among them have up to 192 concurrently running jobs and/or up to 2040 processor cores in use on Pitzer. Jobs submitted in excess of these limits are queued but blocked by the scheduler until other jobs exit and free up resources.

A user may have no more than 1000 jobs submitted to both the parallel and serial job queue separately. Jobs submitted in excess of this limit will be rejected.

Priority

The priority of a job is influenced by a large number of factors, including the processor count requested, the length of time the job has been waiting, and how much other computing has been done by the user and their group over the last several days. However, having the highest priority does not necessarily mean that a job will run immediately, as there must also be enough processors and memory available to run it.

GPU Jobs

All GPU nodes are reserved for jobs that request gpus. Short non-GPU jobs are allowed to backfill on these nodes to allow for better utilization of cluster resources.

Supercomputer:

Owens

Pitzer

Slurm Directives Summary

Slurm directives may appear as header lines in a batch script or as options on the sbatch command line. They specify the resource requirements of your job and various other attributes. Many of the directives are discussed in more detail elsewhere in this document. The online manual page for sbatch (man sbatch) describes many of them.

slurm options specified on the command line will take precedence over slurm options in a job script.

Slurm header lines must come before any executable lines in your script. Their syntax is:

#SBATCH [option]

where option can be one of the options in the table below (there are others which can be found in the manual). For example, to request 4 nodes with 40 processors per node:

#SBATCH --nodes=4
#SBTACH --ntasks-per-node=40
#SBATCH --constraint=40core

The syntax for including an option on the command line is:

sbatch [option]

For example, the following line submits the script myscript.job but adds the --time nodes directive:

sbatch --time=00:30:00 myscript.job

Description and examples of sbatch options
Option	Description
--time=dd-hh:mm:ss	Requests the amount of time needed for the job. Default is one hour.
--nodes=n	Number of nodes to request. Default is one node.
--ntasks-per-node=m	Number of cores on a single node or number of tasks per requested node. Default is a single core.
--gpus-per-node=g	Number of gpus per node. Default is none.
--mem=xgb	Specify the (RAM) main memory required per node.
--licenses=pkg@osc:N	Request use of N licenses for package {software flag}@osc:N.
--job-name=my_name	Sets the job name, which appears in status listings and is used as the prefix in the job’s output and error log files. The job name must not contain spaces.
--mail-type=START	Sets when to send mail to users when the job starts. There are other mail_type options including: END, FAIL.
--mail-user=<email>	Email address(es) separated by commas to send notifications to based on the mail type.
--x11	Enable x11 forwarding for use of graphical applications.
--account=PEX1234	Use the specified for job resource charging.
--cluster=pitzer	Explicitly specify which cluster to submit the job to.
--partition=p	Request a specific partition for the resource allocation instead of let the batch system assign a default partition.
--gres=pfsdir	Request use of $PFSDIR. See scratch space for details.

Slurm defaults

It is also possible to create a file which tells slurm to automatically apply certain directives to jobs.

To start, create file ~/.slurm/defaults

One option is to have the file automatically use a certain project account for job submissions. Simply add the following line to ~/.slurm/defaults

account=PEX1234

The account can also be separated by cluster.

owens:account=PEX1234
pitzer:account=PEX4321

Or even separated to only use the defaults with the sbatch command.

sbatch:*:account=PEX1234

Finally, many of the options available for the sbatch command can be set as a default. Here are some examples.

# always request two cores
ntasks-per-node=2
# on pitzer only, request a 2 hour time limit
pitzer:time=2:00:00

The per-cluster defaults will only apply if one is logged into that cluster and submits there. Using the --cluster=pitzer option while on Owens will not use the defaults defined for Pitzer.
Using default options may make the sinteractive command unusable and the interactive session requests from ondemand unusable as well.
Please contact OSC Help if there are questions.

Batch Environment Variable Summary

The batch system provides several environment variables that you may want to use in your job script. This section is a summary of the most useful of these variables. Many of them are discussed in more detail elsewhere in this document. The ones beginning with SLURM_ are described in the online manual page for sbatch (man sbatch).

Environment Variable	Description
`$TMPDIR`	The absolute path and name of the temporary directory created for this job on the local file system of each node
`$PFSDIR`	The absolute path and name of the temporary directory created for this job on the parallel file system
`$SLURM_SUBMIT_DIR`	The absolute path of the directory from which the batch script was started
`$SLURM_GPUS_ON_NODE`	Number of GPUs allocated to the job on each node (works with --exclusive jobs).
`$SLURM_ARRAY_JOB_ID`	Unique identifier assigned to each member of a job array
`$SLURM_JOB_ID`	The job identifier assigned to the job by the batch system
`$SLURM_JOB_NAME`	The job name supplied by the user

The following environment variables are often used in batch scripts but are not directly related to the batch system.

Environment Variable	Description	Comments
`$OMP_NUM_THREADS`	The number of threads to be used in an OpenMP program	See the discussion of OpenMP elsewhere in this document. Set in your script. Not all OpenMP programs use this value.
`$MV2_ENABLE_AFFINITY`	Thread affinity option for MVAPICH2.	Set this variable to 0 in your script if your program uses both MPI and multithreading. Not needed with MPI-1.
`$HOME`	The absolute path of your home directory.	Use this variable to avoid hard-coding your home directory path in your script.

Batch-Related Command Summary

This section summarizes two groups of batch-related commands: commands that are run on the login nodes to manage your jobs and commands that are run only inside a batch script. Only the most common options are described here.

Many of these commands are discussed in more detail elsewhere in this document. All have online manual pages (example: man sbatch ) unless otherwise noted.

In describing the usage of the commands we use square brackets [like this] to indicate optional arguments. The brackets are not part of the command.

Important note: The batch systems on Pitzer, Ruby, and Owens are entirely separate. Be sure to submit your jobs on a login node for the system you want them to run on. All monitoring while the job is queued or running must be done on the same system also. Your job output, of course, will be visible from both systems.

Commands for managing your jobs

These commands are typically run from a login node to manage your batch jobs. The batch systems on Pitzer and Owens are completely separate, so the commands must be run on the system where the job is to be run.

sbatch

The sbatch command is used to submit a job to the batch system.

Usage	Desctiption	Example
`sbatch [ options ] script`	Submit a script for a batch job. The options list is rarely used but can augment or override the directives in the header lines of the script.	`sbatch sim.job`
`sbatch -t array_request [ options ] jobid`	Submit an array of jobs	`sbatch -t 1-100 sim.job`
`sinteractive [ options ]`	Submit an interactive batch job	`sinteractive -n 4`

squeue

The squeue command is used to display the status of batch jobs.

Usage	Desctiption	Example
`squeue`	Display all jobs currently in the batch system.	`squeue`
`squeue -j jobid`	Display information about job jobid. The `-j` flag uses an alternate format.	`squeue -j 123456`
`squeue -j jobid -l`	Display long status information about job jobid.	`squeue -j 123456 -l`
`squeue -u username [-l]`	Display information about all the jobs belonging to user username.	`squeue -u usr1234`

scancel

The scancel command may be used to delete a queued or running job.

Usage	Description	Example
`scancel jobid`	Delete job `jobid`.	`scancel 123456`
`scancel jobid`	Delete all jobs in job array `jobid`.	`scancel 123456`
`qdel jobid[jobnumber]`	Delete `jobnumber` within job array `jobid`.	`scancel 123456_14`

slurm output file

There is an output file which stores the stdout and stderr for a running job which can be viewed to check the running job output. It is by default located in the dir where the job was submitted and has the format slurm-<jobid>.out

The output file can also be renamed and saved in any valid dir using the option --output=<filename pattern>

Cannot currently pass environment variables into slurm job script and can only specify this when using sbatch command at job submission.
e.g.
sbatch --output=$HOME/test_slurm.out <job-script> works
#SBATCH --output=$HOME/test_slurm.out does NOT work in job script
See slurm migration issues for details.

Do not delete/modify the output file that is generated while your job running. This could cause adverse affects on your running job.

scontrol

The scontrol command may be used to modify the attributes of a queued (not running) job. Not all attributes can be altered.

Usage	Description	Example
`scontrol update jobid=<jobid> [ option ]`	Alter one or more attributes a queued job. The options you can modify are a subset of the directives that can be used when submitting a job.	`scontrol update jobid=123456 --ntasks-per-node=4`

This command can also be used inside a job like so:
scontrol show job=$SLURM_JOB_ID

scontrol hold/release

The qhold command allows you to place a hold on a queued job. The job will be prevented from running until you release the hold with the qrls command.

Usage	Description	Example
`scontrol hold jobid`	Place a user hold on job `jobid`	`scontrol hold 123456`
`scontrol release jobid`	Release a user hold previously placed on job `jobid`	`scontrol release 123456`

scontrol show

The scontrol show command can be used to provide details about a job that is running.

scontrol show job=$SLURM_JOB_ID

Usage	Description	Example
`scontrol show job=<jobid>`	Check the details of a running job.	`scontrol show job=123456`

estimating start time

The squeue command can try to estimate when a queued job will start running. It is extremely unreliable, often making large errors in either direction.

Usage	Description	Example
squeue -j `jobid \ --Format=username,jobid,account,startTime`	Display estimate of start time.	squeue -j 123456 \ `--Format=username,jobid,account,startTime`

Commands used only inside a batch job

These commands can only be used inside a batch job.

srun

Generally used to start an mpi process during a job. Can use most of the options available also from the sbatch command.

Usage	Example
srun <prog>	srun --ntasks-per-node=4 a.out

sbcast/sgather

Tool for copying files to/from all nodes allocated in a job.

Usage
sbcast <src_file> <nodelocaldir>/<dest_file>
sgather <src_file> <shareddir>/<dest_file> sgather -r <src_dir> <sharedir>/dest_dir>

Note: sbcast does not have a recursive cast option, meaning you can't use sbcast -r to scatter multiple files in a directory. Instead, you may use a loop command similar to this:

cd ${the directory that has the files}

for FILE in * 
do
    sbcast -p $FILE $TMPDIR/some_directory/$FILE
done

mpiexec

Use the mpiexec command to run a parallel program or to run multiple processes simultaneously within a job. It is a replacement program for the script mpirun , which is part of the mpich package.
The OSC version of mpiexec is customized to work with our batch environment. There are other mpiexec programs in existence, but it is imperative that you use the one provided with our system.

Usage	Description	Example
`mpiexec progname [ args ]`	Run the executable program `progname` in parallel, with as many processes as there are processors (cores) assigned to the job (nodes*ppn).	`mpiexec myprog` `mpiexec yourprog abc.dat 123`
`mpiexec - ppn 1 progname [ args ]`	Run only one process per node.	`mpiexec -ppn 1 myprog`
`mpiexec - ppn num progname [ args ]`	Run the specified number of processes on each node.	`mpiexec -ppn 3 myprog`
`mpiexec -tv [ options ] progname [ args ]`	Run the program with the TotalView parallel debugger.	`mpiexec -tv myprog` `mpiexec -n num progname [ args ]`
`mpiexec -np num progname [ args ]`	Run only the specified number of processes. ( `-n` and `-np` are equivalent.) Does not spread processes out evenly across nodes.	`mpiexec -n 3 myprog`

The options above apply to the MVAPICH2 and IntelMPI installations at OSC. See the OpenMPI software page for mpiexec usage with OpenMPI.

pbsdcp

The pbsdcp command is a distributed copy command for the Slurm environment. It copies files to or from each node of the cluster assigned to your job. This is needed when copying files to directories which are not shared between nodes, such as $TMPDIR.

Options are -r for recursive and -p to preserve modification times and modes.

Usage Description Example

pbsdcp [-s] [ options ] srcfiles target

“Scatter”. Copy one or more files from shared storage to the target directory on each node (local storage). The -s flag is optional.

pbsdcp -s infile1 infile2 $TMPDIR

pbsdcp model.* $TMPDIR

pbsdcp -g [ options ] srcfiles target “Gather”. Copy the source files from each node to the shared target directory. Wildcards must be enclosed in quotes. pbsdcp -g '$TMPDIR/outfile*' $PBS_O_WORKDIR

Note: In gather mode, if files on different nodes have the same name, they will overwrite each other. In the -g example above, the file names may have the form outfile001 , outfile002 , etc., with each node producing a different set of files.

License software flag usage information

We have licensed applications such as ansys, abaqus, and Schrodinger. These applications have a license server with a limited number of licenses, and you need to check out the licenses when you use the software each time. One problem is that the job scheduler, Slurm, doesn't communicate with the license server. As a result, a job can be launched even there are not enough licenses available, and it fails due to insufficient licenses.

In order to prevent this happen, you need to add the software flag to your job script. The software flag will register your license requests to the Slurm license pool so that Slrum can prevent launching jobs without enough licenses available.

Additonally, we sometimes restrict the number of licenses per group for a specific software to allow for multiple groups to utilize the software.

The syntax for software flags is

#SBATCH -L {software flag}@osc:N

where N is the requesting number of the licenses. If you need more than one software flags, you can use

#SBATCH -L {software flag1}@osc:N,{software flag2}@osc:M

For example, if you need 1 ansys and 10 ansyspar license features, then you can use

$SBATCH -L ansys@osc:1,ansyspar@osc:10

For interactive jobs, you can use, for example,

sinteractive -A {project account} -L ansys@osc:1

When you use the OnDemand VDI, Desktop, or Schrodinger apps, you can put software flags on the "Licenses" field. For OnDemand Abaqus/CAE, COMSOL Multiphysics, and Stata, the software flags will be placed automatically. And, for OnDemand Ansys Workbench, please check on "Reserve ANSYS Parallel Licenses," if you need "ansyspar" license features.

We have the full list of software associated with software flags in the table below. For more information, please click the link on the software name.

	Software flag	Note
abaqus	abaqus(350), abaquscae(10)
ansys	ansys(50), ansyspar(900)
comsol	comsolscript(3)
schrodinger	epik(10), glide(20)[16], ligprep(10), macromodel(10), qikprep(10)
starccm	starccm(80), starccmpar(4,000)
stata	stata(5)
usearch	usearch(1)
ls-dyna, mpp-dyna	lsdyna(1,000)

*The number within the parentheses refers to the total number of licenses for each software flag

*The number within the brackets refers to the number of licenses per group for each software flag

It is critical you follow our instructions because your incomplete actions can affect others' jobs as well. We are actively monitoring the software flag usages, and we will reach out to you if you miss our instructions. Failing to make corrections may result in temporary removal from the license server. We have a Grafana dashboard showing the license and software flag usages. There are software flag requests represented as "SLURM", and actual license usages as "License Server".

License usage checking tool

If you want to make sure your license usage, you can use ~support/bin/myLicenseCheck.

  usage: ~support/bin/myLicenseCheck [-h,--help] SOFTWARE

    -h, --help      print help messages
    SOFTWARE        supported software: ansys, abaqus, comsol, schrodinger, and starccm.

This tool will tell you how many licenses you are actually using from the license server and how many licenses you have requested to the Slurm. But, this won't tell you about each job. So, if you want to figure out for a specific job, please make sure that the job is the only running job while you use the tool.

For assistance

Contact OSC Help for assistance if there are any questions.

Messages from sbatch

sbatch messages

shell warning

Submitting a job without specifying the proper shell will return a warning like below:

sbatch: WARNING: Job script lacks first line beginning with #! shell. Injecting '#!/bin/bash' as first line of job script.

Errors

If an error is encountered, the job is rejected.

Not specifying a project account

It is required to specify an account for a job to run. Please use the --account=<project-code> option to do this.

sbatch: error: ERROR: Job invalid: Must specify account for job
sbatch: error: Job submit/allocate failed: Unspecified error

Incorrrect resource configuration

If one makes a request for a node that doesn't exist, the job is rejected.

salloc: error: Job submit/allocate failed: Requested node configuration is not available

An example is requesting a regaular compute node, while also requesting a larger amount of memory than a compute node has.

Specify wrong account

If a user tries to set the --account option with a project that they are not on, then the job is rejected.

sbatch: error: Job submit/allocate failed: Invalid account or account/partition combination specified

Using a restricted project in a slurm job

If a user submits a job and uses a project that is restricted, the following message will be shown and the job will not be submitted:

sbatch: error: AssocGrpSubmitJobsLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Leading whitespace in job name

Leading whitespace is not supported in SLURM job names. Your job will be rejected with an error message if you submit a job with a space in the job name:

sbatch: error: Invalid directive found in batch script: name

You can fix this by removing leading whitespace in the job name.

Script is empty or only contains whitespace

An empty file is not permitted to be submitted (included whitespace only files).

sbatch: error: Batch script is empty!

sbatch: error: Batch script contains only whitespace!

Supercomputer:

Owens

Service:

HPC

Troubleshooting Batch Problems

License problems

If you get a license error when you try to run a third-party software application, it means either the licenses are all in use or you’re not on the access list for the license. Very rarely there could be a problem with the license server. You should read the software page for the application you’re trying to use and make sure you’ve complied with all the procedures and are correctly requesting the license. Contact OSC Help with any questions.

My job is running slower than it should

Here are a few of the reasons your job may be running slowly:

Your job has exceeded available physical memory and is swapping to disk. This is always a bad thing in an HPC environment as it can slow down your job dramatically. Either cut down on memory usage, request more memory, or spread a parallel job out over more nodes.
Your job isn’t using all the nodes and/or cores you intended it to use. This is usually a problem with your batch script.
Your job is spawning more threads than the number of cores you requested. Context switching involves enough overhead to slow your job.
You are doing too much I/O to the network file servers (home and project directories), or you are doing an excessive number of small I/O operations to the parallel file server. An I/O-bound program will suffer severe slowdowns with improperly configured I/O.
You didn’t optimize your program sufficiently.
You got unlucky and are being hurt by someone else’s misbehaving job. As much as we try to isolate jobs from each other, sometimes a job can cause system-level problems. If you have run your job before and know that it usually runs faster, OSC staff can check for problems.

Someone deleted my job!

If your job is misbehaving, it may be necessary for OSC staff to delete it. Common problems are using up all the virtual memory on a node or performing excessive I/O to a network file server. If this happens you will be contacted by OSC Help with an explanation of the problem and suggestions for fixing it. We appreciate your cooperation in this situation because, much as we try to prevent it, one user’s jobs can interfere with the operation of the system.

Occasionally a problem not caused by your job will cause an unrecoverable situation and your job will have to be deleted. You will be contacted if this happens.

Why can’t I delete my job?

If you can’t delete your job, it usually means a node your job was running on has crashed and the job is no longer running. OSC staff will delete the job.

My job is stuck.

There are multiple reasons that your job may appear to be stuck. If a node that your job is running on crashes, your job may remain in the running job queue long after it should have finished. In this case you will be contacted by OSC and will probably have to resubmit your job.

If you conclude that your job is stuck based on what you see in the slurm output file, it’s possible that the problem is an illusion. This comment applies primarily to code you develop yourself. If you print progress information, for example, “Input complete” and “Setup complete”, the output may be buffered for efficiency, meaning it’s not written to disk immediately, so it won’t show up. To have it written immediate, you’ll have to flush the buffer; most programming languages provide a way to do this.

My job crashed. Can I recover my data?

If your job failed due to a hardware failure or system problem, it may be possible to recover your data from $TMPDIR. If the failure was due to hitting the walltime limit, the data in $TMPDIR would have been deleted immediately. Contact OSC Help for more information.

The trap command can be used in your script to save your data in case your job terminates abnormally.

Contacting OSC Help

If you are having a problem with the batch system on any of OSC's machines, you should send email to oschelp@osc.edu. Including the following information will assist HPC Client Services staff in diagnosing your problem quickly:

Name
OSC User ID (username)
Name of the system you are using
Job ID
Job script
Job output and/or error messages (preferably in context)

Or use the support request page.

batch email notifications

Occasionally, jobs that experience problems may generate emails from staff or automated systems at the center with some information about the nature of the problem. This page provides additional information about the various emails sent, and steps that can be taken to address the problem.

batch emails

All emails from osc about jobs will come from slurm@osc.edu, oschelp@osc.edu, or an email address with the domain @osc.edu

regular job emails

These emails can be turned on/off using the appropriate slurm directives. Other email addresses can also be specified. See the mail options section of job scripts page.

Email type	Description
job began/end	Job began or ended. These are normal emails.
job aborted	Job has ended in an abnormal state.

other emails

There is no option to turn these emails off, as they require us to contact the user that submitted the job. We can work with you if they will be expected. Please contact OSC Help in this case.

Email type	Description
Deleted by administrator	OSC staff may delete running jobs if: The job is using so much memory that it threatens to crash the node it is running on. The job is using more resources than it requested and is interfering with other jobs running on the same node. The job is causing excessive load on some part of the system, typically a network file server. The job is still running at the start of a scheduled downtime. OSC staff may delete queued jobs if: The job requests non-existent resources. A job intended for one system that was submitted on another one. The job can never run because it requests combinations of resources that are disallowed by policy. The user’s credentials are blocked on the system the job was submitted on.
Emails exceed expected volume	Job emails may be delayed if too many are queued to be sent to a single email address. This is to prevent OSC from being blacklisted by the email server.
failure due to hardware/software problem	The node(s) or software that a job was using had a critical issue and the job failed.
overuse of physical memory (RAM)	The node that was in use crashed due to it being out of memory. See out-of-memory (OOM) or excessive memory usage page for more information.
Job requeued	A job may be requeued explicitly by a system administrator or after a node failure.
GPFS unmount	An issue with gpfs may have affected the job. This includes directories located in: `/fs/ess`
Filling up /tmp	Job failed after exhausting the space in a node's local /tmp directory. Please request either an entire node or use scratch.

For assistance

Contact OSC Help for assistance if there are any questions.

Slurm Migration

Overview

Slurm, which stands for Simple Linux Utility for Resource Management, is a widely used open-source HPC resource management and scheduling system that originated at Lawrence Livermore National Laboratory.

It is decided that OSC will be implementing Slurm for job scheduling and resource management, to replace the Torque resource manager and Moab scheduling system that it currently uses, over the course of 2020.

Phases of Slurm Migration

It is expected that on Jan 1, 2021, both Pitzer and Owens clusters will be using Slurm. OSC will be switching to Slurm on Pitzer with the deployment of the new Pitzer hardware in September 2020. Owens migration to Slurm will occur later this fall.

PBS Compatibility Layer

During Slurm migration, OSC enables PBS compatibility layer provided by Slurm in order to make the transition as smooth as possible. Therefore, PBS batch scripts that used to work in the previous Torque/Moab environment mostly still work in Slurm. However, we encourage you to start to convert your PBS batch scripts to Slurm scripts because

PBS compatibility layer usually handles basic cases, and may not be able to handle some complicated cases
Slurm has many features that are not available in Moab/Torque, and the layer will not provide access to those features
OSC may turn off the PBS compatibility layer in the future

Please check the following pages on how to submit a Slurm job:

How to prepare Slurm job scripts
How to submit, monitor and manage jobs
Step-by-step instructions on how to submit jobs
Slurm migration issues
Slides for Sept 23, 2020 Workshop

How to Prepare Slurm Job Scripts

Known Issue

The usage of combing the --ntasks and --ntask-per-node options in a job script can cause some unexpected resource allocations and placement due to a bug in Slurm 23. OSC users are strongly encouraged to review their job scripts for jobs that request both --ntasks and --ntasks-per-node. Jobs should request either --ntasks or --ntasks-per-node, not both.

As the first step, you can submit your PBS batch script as you did before to see whether it works or not. If it does not work, you can either follow this page for step-by-step instructions, or read the tables below to convert your PBS script to Slurm script by yourself. Once the job script is prepared, you can refer to this page to submit and manage your jobs.

Job Submission Options

Use	Torque/Moab	Slurm Equivalent
Script directive	`#PBS`	`#SBATCH`
Job name	`-N <name>`	`--job-name=<name>`
Project account	`-A <account>`	`--account=<account>`
Queue or partition	`-q queuename`	`--partition=queuename`
Wall time limit	`-l walltime=hh:mm:ss`	`--time=hh:mm:ss`
Node count	`-l nodes=N`	`--nodes=N`
Process count per node	`-l ppn=M`	`--ntasks-per-node=M`
Memory limit	`-l mem=Xgb`	`--mem=Xgb` (it is MB by default)
Request GPUs	`-l nodes=N:ppn=M:gpus=G`	`--nodes=N --ntasks-per-node=M --gpus-per-node=G`
Request GPUs in default mode	`-l nodes=N:ppn=M:gpus=G:default`	`--nodes=N --ntasks-per-node=M --gpus-per-node=G --gpu_cmode=shared`
Require pfsdir	`-l nodes=N:ppn=M:pfsdir`	`--nodes=N --ntasks-per-node=M --gres=pfsdir`
Require 'vis'	`-l nodes=N:ppn=M:gpus=G:vis`	`--nodes=N --ntasks-per-node=M --gpus-per-node=G --gres=vis`
Require special property	`-l nodes=N:ppn=M:property`	`--nodes=N --ntasks-per-node=M --constraint=property`
Job array	`-t <array indexes>`	`--array=<indexes>`
Standard output file	`-o <file path>`	`--output=<file path>/<file name> (path must exist, and you must specify the name of the file)`
Standard error file	`-e <file path>`	`--error=<file path>/<file name> (path must exist, and you must specify the name of the file)`
Job dependency	`-W depend=after:jobID[:jobID...]` `-W depend=afterok:jobID[:jobID...]` `-W depend=afternotok:jobID[:jobID...]` `-W depend=afterany:jobID[:jobID...]`	`--dependency=after:jobID[:jobID...]` `--dependency=afterok:jobID[:jobID...]` `--dependency=afternotok:jobID[:jobID...]` `--dependency=afterany:jobID[:jobID...]`
Request event notification	`-m <events>`	`--mail-type=<events>` `Note: multiple mail-type requests may be specified in a comma-separated list:` `--mail-type=BEGIN,END,NONE,FAIL`
Email address	`-M <email address>`	`--mail-user=<email address>`
Software flag	`-l software=pkg1+1%pkg2+4`	`--licenses=pkg1@osc:1,pkg2@osc:4`
Require reservation	`-l advres=rsvid`	`--reservation=rsvid`

Job Environment Variables

Info	Torque/Moab Environment Variable	Slurm Equivalent
Job ID	`$PBS_JOBID`	`$SLURM_JOB_ID`
Job name	`$PBS_JOBNAME`	`$SLURM_JOB_NAME`
Queue name	`$PBS_QUEUE`	`$SLURM_JOB_PARTITION`
Submit directory	`$PBS_O_WORKDIR`	`$SLURM_SUBMIT_DIR`
Node file	`cat $PBS_NODEFILE`	`srun hostname \|sort -n`
Number of processes	`$PBS_NP`	`$SLURM_NTASKS`
Number of nodes allocated	`$PBS_NUM_NODES`	`$SLURM_JOB_NUM_NODES`
Number of processes per node	`$PBS_NUM_PPN`	`$SLURM_TASKS_PER_NODE`
Walltime	`$PBS_WALLTIME`	`$SLURM_TIME_LIMIT`
Job array ID	`$PBS_ARRAYID`	`$SLURM_ARRAY_JOB_ID`
Job array index	`$PBS_ARRAY_INDEX`	`$SLURM_ARRAY_TASK_ID`

Environment Variables Specific to OSC

Environment variable	Description
`$TMPDIR`	Path to a node-specific temporary directory (/tmp) for a given job
`$PFSDIR`	Path to the scratch storage; only present if --gres request includes pfsdir.
`$SLURM_GPUS_ON_NODE`	Number of GPUs allocated to the job on each node (works with --exclusive jobs)
`$SLURM_JOB_GRES`	The job's GRES request
`$SLURM_JOB_CONSTRAINT`	The job's constraint request
`$SLURM_TIME_LIMIT`	Job walltime in seconds

Commands in a Batch Job

Use Torque/Moab Environment Variable Slurm Equivalent

Launch a parallel program inside a job mpiexec <args> srun <args>

Scatter a file to node-local file systems

Use	Torque/Moab Environment Variable	Slurm Equivalent
Launch a parallel program inside a job	`mpiexec <args>`	`srun <args>`
Scatter a file to node-local file systems	`pbsdcp <file> <nodelocaldir>`	`sbcast <src_file> <nodelocaldir>/<dest_file>` * Note: sbcast does not have a recursive cast option, meaning you can't use `sbcast -r` to scatter multiple files in a directory. Instead, you may use a loop command similar to this: `cd ${the directory that has the files}` `for FILE in *` `do` `sbcast -p $FILE $TMPDIR/some_directory/$FILE` `done`
Gather node-local files to a shared file system	`pbsdcp -g <file> <shareddir>`	`sgather <src_file> <shareddir>/<dest_file> sgather -r <src_dir> <sharedir>/dest_dir>`

pbsdcp <file> <nodelocaldir>

sbcast <src_file> <nodelocaldir>/<dest_file>

* Note: sbcast does not have a recursive cast option, meaning you can't use sbcast -r to scatter multiple files in a directory. Instead, you may use a loop command similar to this:

cd ${the directory that has the files}

for FILE in * 
do
    sbcast -p $FILE $TMPDIR/some_directory/$FILE
done

Gather node-local files to a shared file system

pbsdcp -g <file> <shareddir>

sgather <src_file> <shareddir>/<dest_file> sgather -r <src_dir> <sharedir>/dest_dir>

Supercomputer:

Owens

Pitzer

How to Submit, Monitor and Manage Jobs

Submit Jobs

Use Torque/Moab Command Slurm Equivalent

Submit batch job qsub <jobscript> sbatch <jobscript>

Submit interactive job

Use	Torque/Moab Command	Slurm Equivalent
Submit batch job	`qsub <jobscript>`	`sbatch <jobscript>`
Submit interactive job	`qsub -I [options]`	`sinteractive [options]` `salloc [options]`

qsub -I [options]

sinteractive [options]

salloc [options]

Notice: If a node fails, then the running job will be automatically resubmitted to the queue and will only be charged for the resubmission time and not the failed time.
One can use --mail-type=ALL option in their script to receive notifications about their jobs. Please see the slurm sbatch man page for more information.
Another option is to disable the resubmission using --no-requeue so that the job does get submitted on node failure.
A final note is that if the job does not get requeued after a failure, then there will be a charged incurred for the time that the job ran before it failed.

Interactive jobs

Submitting interactive jobs is a bit different in Slurm. When the job is ready, one is logged into the login node they submitted the job from. From there, one can then login to one of the reserved nodes.

You can use the custom tool sinteractive as:

[xwang@pitzer-login04 ~]$ sinteractive
salloc: Pending job allocation 14269
salloc: job 14269 queued and waiting for resources
salloc: job 14269 has been allocated resources
salloc: Granted job allocation 14269
salloc: Waiting for resource configuration
salloc: Nodes p0591 are ready for job
...
...
[xwang@p0593 ~] $
# can now start executing commands interactively

Or, you can use salloc as:

[user@pitzer-login04 ~] $ salloc -t 00:05:00 --ntasks-per-node=3
salloc: Pending job allocation 14209
salloc: job 14209 queued and waiting for resources
salloc: job 14209 has been allocated resources
salloc: Granted job allocation 14209
salloc: Waiting for resource configuration
salloc: Nodes p0593 are ready for job

# normal login display
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
14210 serial-48     bash     usee  R       0:06      1 p0593
[user@pitzer-login04 ~]$ srun --jobid=14210 --pty /bin/bash
# normal login display
[user@p0593 ~] $
# can now start executing commands interactively

Manage Jobs

Use	Torque/Moab Command	Slurm Equivalent
Delete a job*	`qdel <jobid>`	`scancel <jobid>`
Hold a job	`qhold <jobid>`	`scontrol hold <jobid>`
Release a job	`qrls <jobid>`	`scontrol release <jobid>`

* User is eligible to delete his own jobs. PI/project admin is eligible to delete jobs submitted to the project he is an admin on.

Monitor Jobs

Use	Torque/Moab Command	Slurm Equivalent
Job list summary	`qstat` or `showq`	`squeue`
Detailed job information	`qstat -f <jobid>` or `checkjob <jobid>`	`sstat -a <jobid>` or `scontrol show job <jobid>`
Job information by a user	`qstat -u <user>`	`squeue -u <user>`
View job script (system admin only)	`js <jobid>`	`jobscript <jobid>`
Show expected start time	`showstart <job ID>`	`squeue --start --jobs=<jobid>`

Supercomputer:

Owens

Pitzer

Steps on How to Submit Jobs

How to Submit Interactive jobs

There are different ways to submit interactive jobs.

Using `qsub`

qsub command is patched locally to handle the interactive jobs. So mostly you can use the qsub command as before:

[xwang@pitzer-login04 ~]$ qsub -I -l nodes=1 -A PZS0712
salloc: Pending job allocation 15387
salloc: job 15387 queued and waiting for resources
salloc: job 15387 has been allocated resources
salloc: Granted job allocation 15387
salloc: Waiting for resource configuration
salloc: Nodes p0601 are ready for job
...
[xwang@p0601 ~]$ 
# can now start executing commands interactively

Using `sinteractive`

You can use the custom tool sinteractive as:

[xwang@pitzer-login04 ~]$ sinteractive
salloc: Pending job allocation 14269
salloc: job 14269 queued and waiting for resources
salloc: job 14269 has been allocated resources
salloc: Granted job allocation 14269
salloc: Waiting for resource configuration
salloc: Nodes p0591 are ready for job
...
...
[xwang@p0593 ~] $
# can now start executing commands interactively

Using `salloc`

It is a little complicated if you use salloc . Below is a simple example:

[user@pitzer-login04 ~] $ salloc -t 00:30:00 --ntasks-per-node=3 srun --pty /bin/bash
salloc: Pending job allocation 2337639
salloc: job 2337639 queued and waiting for resources
salloc: job 2337639 has been allocated resources
salloc: Granted job allocation 2337639
salloc: Waiting for resource configuration
salloc: Nodes p0002 are ready for job

# normal login display
[user@p0002 ~]$
# can now start executing commands interactively

How to Submit Non-interactive jobs

Submit PBS job Script

Since we have the compatibility layer installed, your current PBS scripts may still work as they are, so you should test them and see if they submit and run successfully. Submit your PBS batch script as you did before to see whether it works or not. Below is a simple PBS job script pbs_job.txt that calls for a parallel run:

#PBS -l walltime=1:00:00
#PBS -l nodes=2:ppn=40
#PBS -N hello
#PBS -A PZS0712

cd $PBS_O_WORKDIR
module load intel
mpicc -O2 hello.c -o hello
mpiexec ./hello > hello_results

Submit this script on Pitzer using the command qsub pbs_job.txt , and this job is scheduled successfully as shown below:

[xwang@pitzer-login04 slurm]$ qsub pbs_job.txt 
14177

Check the Job

You can use the jobscript command to check the job information:

[xwang@pitzer-login04 slurm]$ jobscript 14177
-------------------- BEGIN jobid=14177 --------------------
#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l nodes=2:ppn=40
#PBS -N hello
#PBS -A PZS0712

cd $PBS_O_WORKDIR
module load intel
mpicc -O2 hello.c -o hello
mpiexec ./hello > hello_results

-------------------- END jobid=14177 --------------------

Please note that there is an extra line #!/bin/bash added at the beginning of the job script from the output. This line is added by Slurm's qsub compatibility script because Slurm job scripts must have #!<SHELL> as its first line.

You will get this message explicitly if you submit the script using the command sbatch pbs_job.txt

[xwang@pitzer-login04 slurm]$ sbatch pbs_job.txt 
sbatch: WARNING: Job script lacks first line beginning with #! shell. Injecting '#!/bin/bash' as first line of job script.
Submitted batch job 14180

Alternative Way: Convert PBS Script to Slurm Script

An alternative way is that we convert the PBS job script (pbs_job.txt) to Slurm script (slurm_job.txt) before submitting the job. The table below shows the comparisons between the two scripts (see this page for more information on the job submission options):

Explanations	Torque	Slurm
Line that specifies the shell	No need	#!/bin/bash
Resource specification	#PBS -l walltime=1:00:00 #PBS -l nodes=2:ppn=40 #PBS -N hello #PBS -A PZS0712	#SBATCH --time=1:00:00 #SBATCH --nodes=2 --ntasks-per-node=40 #SBATCH --job-name=hello #SBATCH --account=PZS0712
Variables, paths, and modules	cd $PBS_O_WORKDIR module load intel	cd $SLURM_SUBMIT_DIR module load intel
Launch and run application	mpicc -O2 hello.c -o hello mpiexec ./hello > hello_results	mpicc -O2 hello.c -o hello srun ./hello > hello_results

In this example, the line cd $SLURM_SUBMIT_DIR can be omitted in the Slurm script because your Slurm job always starts in your submission directory, which is different from Torque/Moab environment where your job always starts in your home directory.

Once the script is ready, you submit the script using the command sbatch slurm_job.txt

[xwang@pitzer-login04 slurm]$ sbatch slurm_job.txt 
Submitted batch job 14215

Supercomputer:

Owens

Pitzer

Slurm Migration Issues

This page documents the known issues for migrating jobs from Torque to Slurm.

$PBS_NODEFILE and $SLURM_JOB_NODELIST

Please be aware that $PBS_NODEFILE is a file while $SLURM_JOB_NODELIST is a string variable.

The analog on Slurm to cat $PBS_NODEFILE is srun hostname | sort -n

Environment variables are not evaluated in job script directives

Environment variables do not work in a slurm directive inside a job script.

The job script job.txt including #SBATCH --output=$HOME/jobtest.out won't work in Slurm. Please use the following instead:

sbatch --output=$HOME/jobtest.out job.txt

Using mpiexec with Intel MPI

Intel MPI (all versions through 2019.x) is configured to support PMI and Hydra process managers. It is recommended to use srun as the MPI program launcher. This is a possible symptom of using mpiexec/mpirun:

srun: error: PMK_KVS_Barrier duplicate request from task 0

as well as:

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found

If you prefer using mpiexec/mpirun with SLURM, please add the following code to the batch script before running any MPI executable:

unset I_MPI_PMI_LIBRARY 
export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0   # the option -ppn only works if you set this before

Executables with a certain MPI library using SLURM PMI2 interface

e.g.

Stopping mpi4py python processes during an interactive job session only from a login node:

$ salloc -t 15:00 --ntasks-per-node=4
salloc: Pending job allocation 20822
salloc: job 20822 queued and waiting for resources
salloc: job 20822 has been allocated resources
salloc: Granted job allocation 20822
salloc: Waiting for resource configuration
salloc: Nodes p0511 are ready for job
# don't login to one of the allocated nodes, stay on the login node
$ module load python/3.7-2019.10
$ source activate testing
(testing) $ srun --quit-on-interrupt python mpi4py-test.py
# enter <ctrl-c>
^Csrun: sending Ctrl-C to job 20822.5
Hello World (from process 0)
process 0 is sleeping...
Hello World (from process 2)
process 2 is sleeping...
Hello World (from process 3)
process 3 is sleeping...
Hello World (from process 1)
process 1 is sleeping...
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 20822.5 ON p0511 CANCELLED AT 2020-09-04T10:13:44 ***
# still in the job and able to restart the processes
(testing)

pbsdcp with Slurm

pbsdcp with gather option sometimes does not work correctly. It is suggested to use sbcast for scatter option and sgather for gather option instead of pbsdcp. Please be aware that there is no wildcard (*) option for sbcast / sgather . And there is no recursive option for sbcast.In addition, the destination file/directory must exist.

Here are some simple examples:

sbcast <src_file> <nodelocaldir>/<dest_file>
sgather <src_file> <shareddir>/<dest_file>
sgather -r --keep <src_dir> <sharedir>/dest_dir>

Signal handling in slurm

The below script needs to use a wait command for the user-defined signal USR1 to be received by the process.

The sleep process is backgrounded using & wait so that the bash shell can receive signals and execute the trap commands instead of ignoring the signals while the sleep process is running.

#!/bin/bash
#SBATCH --job-name=minimal_trap
#SBATCH --time=2:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --output=%x.%A.log
#SBATCH --signal=B:USR1@60

function my_handler() {
  echo "Catching signal"
  touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal
  exit
}

trap my_handler USR1
trap my_handler TERM

sleep 3600 &
wait

reference: https://bugs.schedmd.com/show_bug.cgi?id=9715

'mail' does not work; use 'sendmail'

The 'mail' does not work in a batch job; use 'sendmail' instead as:

sendmail user@example.com <<EOF
subject: Output path from $SLURM_JOB_ID
from: user@example.com
...
EOF

srun' with no arguments is to allocate a single task when using 'sinteractive'

srun with no arguments is to allocate a single task when using sinteractive to request an interactive job, even you request more than one task. Please pass the needed arguments to srun:

[xwang@owens-login04 ~]$ sinteractive -n 2 -A PZS0712
...
[xwang@o0019 ~]$ srun hostname
o0019.ten.osc.edu
[xwang@o0019 ~]$ srun -n 2 hostname
o0019.ten.osc.edu
o0019.ten.osc.edu

Be careful not to overwrite a Slurm batch output file for a running job

Unlike a PBS batch output file, which lived in a user-non-writeable directory while the job was running, a Slurm batch output file resides under the user's home directory while the job is running. File operations, such as editing and copying, are permitted. Please be careful to avoid such operations while the job is running. In particular, this batch script idiom is no longer correct (e.g., for the default job output file of name $SLURM_SUBMIT_DIR/slurm-jobid.out):

cd $SLURM_SUBMIT_DIR
cp -r * $TMPDIR
cd $TMPDIR
...
cp *.out* $SLURM_SUBMIT_DIR

Please submit any issue using the webform below:

Supercomputer:

Owens

Pitzer

Batch Processing at OSC