Monitoring and Managing Your Job

Several commands allow you to check job status, monitor execution, collect performance statistics or even delete your job, if necessary.

Status of queued jobs

There are many possible reasons for a long queue wait — read on to learn how to check job status and for more about how job scheduling works.

squeue

Use the squeue command to check the status of your jobs, including whether your job is queued or running and information about requested resources. If the job is running, you can view elapsed time and resources used.

Here are some examples for user usr1234 and job 123456.

By itself, squeue lists all jobs in the system.

To list all the jobs belonging to a particular user:

squeue -u usr1234

To list the status of a particular job, in standard or alternate (more useful) format:

squeue -j 123456

To get more detail about a particular job:

squeue -j 123456 -l

You may also filter output by the state of a job. To view only running jobs use:

squeue -u usr1234 -t RUNNING

Other states can be seen in the JOB STATE CODES section of squeue man page using man squeue.

Additionally, JOB REASON CODES may be retrieved using the  -l with the command man squeue. These codes describe the nodes allocated to running jobs or the reasons a job is pending, which may include:

  • Reason code "MaxCpuPerAccount": A user or group has reached the limit on the number of cores allowed. The rest of the user or group's jobs will be pending until the number of cores in use decreases.
  • Reason code "Dependency": Dependencies among jobs or conditions that must be met before a job can run have not yet been satisfied.

You can place a hold on your own job using scontrol hold jobid. If you do not understand the state of your job, contact OSC Help for assistance.

To list blocked jobs:

squeue -u usr1234 -t PENDING

The --start option estimates the start time for a pending job. Unfortunately, these estimates are not at all accurate except for the highest priority job in the queue.

Why isn’t my job running?

There are many reasons that your job may have to wait in the queue longer than you would like, including:

  • System load is high.
  • A system downtime has been scheduled and jobs are being held. Check the system notices posted on the OSC Events page or the message of the day, displayed when you log in.
  • You or your group are at the maximum processor count or running job count and your job is being held.
  • Your job is requesting specialized resources, such as GPU nodes or large memory nodes or certain software licenses, that are in high demand and not available.
  • Your job is requesting a lot of resources. It takes time for the resources to become available.
  • Your job is requesting incompatible or nonexistent resources and can never run.
  • Job is unnecessarily stuck in batch hold because of system problems (very rare).

Priority, backfill and debug reservations

Priority is a complicated function of many factors, including the processor count and walltime requested, the length of time the job has been waiting and more.

During each scheduling iteration, the scheduler will identify the highest priority job that cannot currently be run and find a time in the future to reserve for it. Once that is done, the scheduler will then try to backfill as many lower priority jobs as it can without affecting the highest priority job's start time. This keeps the overall utilization of the system high while still allowing reasonable turnaround time for high priority jobs. Short jobs and jobs requesting few resources are the easiest to backfill.

A small number of nodes are set aside during the day for jobs with a walltime limit of 1 hour or less, primarily for debugging purposes.

Observing a running job

You can monitor a running batch job as easily as you can monitor a program running interactively. Simply view the output file in read only mode to check the current output of the job.

Node status

You may check the status of a node while the job is running by visiting the OSC grafana page and using the "cluster metrics" report.

Managing your jobs

Deleting a job

Situations may arise that call for deletion of a job from the SLURM queue, such as incorrect resource limits, missing or incorrect input files or commands or a program taking too long to run (infinite loop).

The command to delete a batch job is scancel. It applies to both queued and running jobs.

Example:

scancel 123456

If you cannot delete one of your jobs, it may be because of a hardware problem or system software crash. In this case you should contact OSC Help.

Altering a queued job

You can alter certain attributes of a job in the queue using the scontrol update command. Use this command to make a change without losing your place in the queue. Please note that you cannot make any alterations to the executable portion of the script, nor can you make any changes after the job starts running.

The syntax is:

scontrol update job=<jobid> <args>

The optional arguments consist of one or more SLURM directives in the form of command-line options.

For example, to change the walltime limit on job 123456 to 5 hours and have email sent when the job ends (only):

scontrol update job=123456 timeLimit=5:00:00 mailType=End

Placing a hold on a queued job

If you want to prevent a job from running but leave it in the queue, you can place a hold on it using the scontrol hold command. The job will remain pending until you release it with the scontrol release command. A hold can be useful if you need to modify the input file for a job without losing your place in the queue.

Examples:

scontrol hold 123456
scontrol release 123456

Job statistics

Include the following commands in your batch script as appropriate to collect job statistics or performance information.

A simple way to view job information is to use this command at the end of the job:

scontrol show job=$SLURM_JOB_ID

XDMoD tool

You can use the online interactive tool XDMoD to look at usage statistics for jobs. See XDMoD overview for more information.

date

The date command prints the current date and time. It can be informative to include it at the beginning and end of the executable portion of your script as a rough measure of time spent in the job.

time

The time utility is used to measure the performance of a single command. It can be used for serial or parallel processes. Add /usr/bin/time to the beginning of a command in the batch script:

/usr/bin/time myprog arg1 arg2

The result is provided in the following format:

  1. user time (CPU time spent running your program)
  2. system time (CPU time spent by your program in system calls)
  3. elapsed time (wallclock)
  4. percent CPU used
  5. memory, pagefault and swap statistics
  6. I/O statistics

These results are appended to the job's error log file. Note: Use the full path “/usr/bin/time” to get all the information shown.

Supercomputer: