Monitoring and Managing Your Job

There are several commands available that allow you to check the status of your job, monitor execution of a running job, and collect performance statistics for your job. You can also delete a job if necessary.

Status of queued jobs

You can monitor the batch queues and check the status of your job using the command squeue. This section also addresses the question of why a job may have a long queue wait and explains a little about how job scheduling works.

squeue

Use the squeue command to check the status of your jobs. You can see whether your job is queued or running, along with information about requested resources. If the job is running you can see elapsed time and resources used.

Here are some examples for user usr1234 and job 123456.

By itself, squeue lists all jobs in the system.

To list all the jobs belonging to a particular user:

squeue -u usr1234

To list the status of a particular job, in standard or alternate (more useful!) format:

squeue -j 123456

To get more details about a particular job:

squeue -j 123456 -l

The output can also be filtered by the state of a job.

To view only running jobs use:

squeue -u usr1234 -t RUNNING

Other states can be seen in the JOB STATE CODES section of squeue man page.

There are also JOB REASON CODES mentioned in the man page, will describe why a job is pending or the nodes that the job was allocated if it is running. Use the  -l to view this information. There are several reasons a job may be pending.

  • If a user or group has reached the limit on the number of cores allowed, the rest of their jobs will be pending with a reason code of MaxCpuPerAccount.
  • If a user sets up dependencies among jobs or conditions that have to be met before a job can run, the jobs will be pending until the dependencies or conditions are met. The reason code will be Dependency.
  • You can place a hold on your own job using scontrol hold jobid.
  • If you see one of your jobs in a state that is not understood, contact OSC Help for assistance.

To list blocked jobs:

squeue -u usr1234 -t PENDING

The --start option gives an estimate for the start time of a job that is pending. Unfortunately, these estimates are not at all accurate except for the highest priority job in the queue.

Why isn’t my job running?

There are many reasons that your job may have to wait in the queue longer than you would like. Here are some of them.

  • System load is high. It’s frustrating for everyone!
  • A system downtime has been scheduled and jobs are being held. Check the message of the day, which is displayed every time you login, or the system notices posted on OSC webpage.
  • You or your group have used a lot of resources in the last few days, causing your job priority to be lowered (“fairness policy”).
  • You or your group are at the maximum processor count or running job count and your job is being held.
  • Your job is requesting specialized resources, such as large memory or certain software licences, that are in high demand.
  • Your job is requesting a lot of resources. It takes time for the resources to become available.
  • Your job is requesting incompatible or nonexistent resources and can never run.
  • Job is unnecessarily stuck in batch hold because of system problems (very rare!).

Priority, backfill, and debug reservations

Priority is a complicated function of many factors, including the processor count and walltime requested, the length of time the job has been waiting, and how much other computing has been done by the user and their group over the last several days.

During each scheduling iteration, the scheduler will identify the highest priority job that cannot currently be run and find a time in the future to reserve for it. Once that is done, the scheduler will then try to backfill as many lower priority jobs as it can without affecting the highest priority job's start time. This keeps the overall utilization of the system high while still allowing reasonable turnaround time for high priority jobs. Short jobs and jobs requesting few resources are the easiest to backfill.

A small number of nodes are set aside during the day for jobs with a walltime limit of 1 hour or less, primarily for debugging purposes.

Observing a running job

You can monitor a running batch job almost as easily as you can monitor a program running interactively. All that is needed is to view the output file in read-only mode to check the current output of the job.

node status

It may be useful to check the status of a node while the job is running. This can be done by visiting the OSC grafana page and using the 'cluster metrics' report.

Managing your jobs

Deleting a job

Situations may arise in which you want to delete one of your jobs from the SLURM queue. Perhaps you set the resource limits incorrectly, neglected to copy an input file, or had incorrect or missing commands in the batch file. Or maybe the program is taking too long to run (infinite loop).

The command to delete a batch job is scancel. It applies to both queued and running jobs.

Example:

scancel 123456

If you are unable to delete one of your jobs, it may be because of a hardware problem or system software crash. In this case you should contact OSC Help.

Altering a queued job

You can alter certain attributes of your job while it’s in the queue using the scontrol update command. This can be useful if you want to make a change without losing your place in the queue. You cannot make any alterations to the executable portion of the script, nor can you make any changes after the job starts running.

The syntax is:

scontrol update job=<jobid> <args>

The options argument consists of one or more SLURM directives in the form of command-line options.

For example, to change the walltime limit on job 123456 to 5 hours and have email sent when the job ends (only):

scontrol update job=123456 timeLimit=5:00:00 mailType=End

Placing a hold on a queued job

If you want to prevent a job from running but leave it in the queue, you can place a hold on it using the scontrol hold command. The job will remain pending until you release it with the sontrol release command. A hold can be useful if you need to modify the input file for a job, for example, but you don’t want to lose your place in the queue.

Examples:

scontrol hold 123456
scontrol release 123456

Job statistics

There are commands you can include in your batch script to collect job statistics or performance information.

A simple way to view job information is to use this command at the end of the job:

scontrol show job=$SLURM_JOB_ID

XDMoD tool

The online interactive tool XDMoD can be used to look at the usage statistics for jobs.

See XDMoD overview for more information on XDMoD.

date

The date command prints the current date and time. It can be informative to include it at the beginning and end of the executable portion of your script as a rough measure of time spent in the job.

time

The time utility is used to measure the performance of a single command. It can be used for serial or parallel processes. Add /usr/bin/time to the beginning of a command in the batch script:

/usr/bin/time myprog arg1 arg2

The result is provided in the following format:

  1. user time (CPU time spent running your program)
  2. system time (CPU time spent by your program in system calls)
  3. elapsed time (wallclock)
  4. % CPU used
  5. memory, pagefault and swap statistics
  6. I/O statistics

These results are appended to the job's error log file. Note: Use the full path “/usr/bin/time” to get all the information shown.