Batch Execution Environment

Shell and initialization

Your batch script executes in a shell on a compute node. The environment is identical to what you get when you connect to a login node except that you have access to all the resources requested by your job. The shell that Slurm uses is determined by the first line of the job script (it is by default #!/bin/bash). The appropriate “dot-files” ( .login , .profile , .cshrc ) will be executed, the same as when you log in. (For information on overriding the default shell, see the Job Scripts section.)

The job begins in the directory that it was submitted from. You can use the cd command to change to a different directory. The environment variable $SLURM_SUBMIT_DIR makes it easy to return to the directory from which you submitted the job:

cd $SLURM_SUBMIT_DIR

Modules

There are dozens of software packages available on OSC’s systems, many of them with multiple versions. You control what software is available in your environment by loading the module for the software you need. Each module sets certain environment variables required by the software.

If you are running software that was installed by OSC, you should check the software documentation page to find out what modules to load.

Several modules are automatically loaded for you when you login or start a batch script. These default modules include

  • modules required by the batch system
  • the Intel compiler suite
  • an MPI package compatible with the default compiler (for parallel computing)

The module command has a number of subcommands. For more details, type module help.

Certain modules are incompatible with each other and should never be loaded at the same time. Examples are different versions of the same software or multiple installations of a library built with different compilers.

Note to those who build or install their own software: Be sure to load the same modules when you run your software that you had loaded when you built it, including the compiler module.

Each module has both a name and a version number. When more than one version is available for the same name, one of them is designated as the default. For example, the following modules are available for the Intel compilers on Owens: (Note: The versions shown might be out of date but the concept is the same.)

  • intel/12.1.0 (default)
  • intel/12.1.4.319

If you specify just the name, it refers to the default version or the currently loaded version, depending on the context. If you want a different version, you must give the entire string including the version information.

You can have only one compiler module loaded at a time, either intel, pgi, or gnu. The intel module is loaded initially; to change to pgi or gnu, do a module swap (see example below).

Some software libraries have multiple installations built for use with different compilers. The module system will load the one compatible with the compiler you have loaded. If you swap compilers, all the compiler-dependent modules will also be swapped.

Special note to gnu compiler users: While the gnu compilers are always in your path, you should load the gnu compiler module to ensure you are linking to the correct library versions.

To list the modules you have loaded:

module list

To see all modules that are compatible with your currently loaded modules:

module avail

To see all modules whose names start with fftw:

module avail fftw

To see all possible modules:

module spider

To see all possible modules whose names start with fftw:

module spider fftw

To load the fftw3 module that is compatible with your current compiler:

module load fftw3

To unload the fftw3 module:

module unload fftw3

To load the default version of the abaqus module (not compiler-dependent):

module load abaqus

To load a different version of the abaqus module:

module load abaqus/6.8-4

To unload whatever abaqus module you have loaded:

module unload abaqus

To unload all modules:

module purge

To reset to default starting modules:

module reset

To swap the intel compilers for the pgi compilers (unloads intel, loads pgi):

module swap intel pgi

To swap the default version of the intel compilers for a different version:

module swap intel intel/12.1.4.319

To display help information for the mkl module:

module help mkl

To display the commands run by the mkl module:

module show mkl

To use a locally installed module, first import the module directory:

module use [/path/to/modulefiles]

And then load the module:

module load localmodule

Slurm environment variables

Your batch execution environment has all the environment variables that your login environment has plus several that are set by the batch system. This section gives examples for using some of them. For more information see man sbatch.

Directories

Several directories may be useful in your job.

The absolute path of the directory your job was submitted from is $SLURM_SUBMIT_DIR.

Each job has a temporary directory, $TMPDIR , on the local disk of each node assigned to it. Access to this directory is much faster than access to your home or project directory. The files in this directory are not visible from all the nodes in a parallel job; each node has its own directory. The batch system creates this directory when your job starts and deletes it when your job ends. To copy file input.dat to $TMPDIR on your job’s first node:

cp input.dat $TMPDIR

For parallel job, to copy file input.dat to $TMPDIR on all your job’s nodes:

sbcast input.dat $TMPDIR/input.dat

Each job also has a temporary directory, $PFSDIR , on the parallel scratch file system, if users add node attribute "pfsdir" in the batch request (--gres=pfsdir). This is a single directory shared by all the nodes a job is running on. Access is faster than access to your home or project directory but not as fast as $TMPDIR . The batch system creates this directory when your job starts and deletes it when your job ends. To copy the file output.dat from this directory to the directory you submitted your job from:

cp $PFSDIR/output.dat $SLURM_SUBMIT_DIR

The $HOME environment variable refers to your home directory. It is not set by the batch system but is useful in some job scripts. It is better to use $HOME than to hardcode the path to your home directory. To access a file in your home directory:

cat $HOME/myfile

Job information

A list of the nodes and cores assigned to your job is obtained using srun hostname |sort -n

For GPU jobs, a list of the GPUs assigned to your job is in the file $SLURM_GPUS_ON_NODE. To display this file:

cat $SLURM_GPUS_ON_NODE

If you use a job array, each job in the array gets its identifier within the array in the variable $SLURM_ARRAY_JOB_ID. To pass a file name parameterized by the array ID into your application:

./a.out input_$SLURM_ARRAY_JOB_ID.dat

To display the numeric job identifier assigned by the batch system:

echo $SLURM_JOB_ID

To display the job name:

echo $SLURM_JOB_NAME

Use fast storage

If your job does a lot of file-based input and output, your choice of file system can make a huge difference in the performance of the job.

Shared file systems

Your home directory is located on shared file systems, providing long-term storage that is accessible from all OSC systems. Shared file systems are relatively slow. They cannot handle heavy loads such as those generated by large parallel jobs or many simultaneous serial jobs. You should minimize the I/O your jobs do on the shared file systems. It is usually best to copy your input data to fast temporary storage, run your program there, and copy your results back to your home directory.

Batch-managed directories

Batch-managed directories are temporary directories that exist only for the duration of a job. They exist on two types of storage: disks local to the compute nodes and a parallel scratch file system.

A big advantage of batch-managed directories is that the batch system deletes them when a job ends, preventing clutter on the disk.

A disadvantage of batch-managed directories is that you can’t access them after your job ends. Be sure to include commands in your script to copy any files you need to long-term storage. To avoid losing your files if your job ends abnormally, for example by hitting its walltime limit, include a trap command in your script (Note:  trap  commands do not work in csh and tcsh shell batch scripts). The following example creates a subdirectory in $SLURM_SUBMIT_DIR and copies everything from $TMPDIR into it in case of abnormal termination.

trap "cd $SLURM_SUBMIT_DIR;mkdir $SLURM_JOB_ID;cp -R $TMPDIR/* $SLURM_SUBMIT_DIR;exit" TERM

If a node your job is running on crashes, the trap command may not be executed. It may be possible to recover your batch-managed directories in this case. Contact OSC Help for assistance. For other details on retrieving files from unexpectedly terminated jobs, see this FAQ.

Local disk space

The fastest storage is on a disk local to the node your job is running on, accessed through the environment variable $TMPDIR . The main drawback to local storage is that each node of a parallel job has its own directory and cannot access the files on other nodes. 

Local disk space should be used only through the batch-managed directory created for your job. Please do not use /tmp directly because your files won’t be cleaned up properly.

Parallel file system

The parallel file system, including project directory and scratch directory, is faster than the shared file systems for large-scale I/O and can handle a much higher load. It is efficient for reading and writing data in large blocks and should not be used for I/O involving many small accesses.

The scratch file system can be used through the batch-managed directory created for your job. The path for this directory is in the environment variable $PFSDIR . You should use it when your files must be accessible by all the nodes in your job and also when your files are too large for the local disk.

You may also create a directory for yourself in scratch file system and use it the way you would use any other directory. This directory will not be backed up; files are subject to deletion after some number of months.

Note: You should not copy your executable files to $PFSDIR. They should be run from your home directories or from $TMPDIR.

Supercomputer: