Several commands allow you to check job status, monitor execution, collect performance statistics or even delete your job, if necessary.
Status of queued jobs
There are many possible reasons for a long queue wait — read on to learn how to check job status and for more about how job scheduling works.
squeue command to check the status of your jobs, including whether your job is queued or running and information about requested resources. If the job is running, you can view elapsed time and resources used.
Here are some examples for user usr1234 and job 123456.
squeue lists all jobs in the system.
To list all the jobs belonging to a particular user:
squeue -u usr1234
To list the status of a particular job, in standard or alternate (more useful) format:
squeue -j 123456
To get more detail about a particular job:
squeue -j 123456 -l
You may also filter output by the state of a job. To view only running jobs use:
squeue -u usr1234 -t RUNNING
Other states can be seen in the JOB STATE CODES section of squeue man page using
Additionally, JOB REASON CODES may be retrieved using the
-l with the command
man squeue. These codes describe the nodes allocated to running jobs or the reasons a job is pending, which may include:
- Reason code "MaxCpuPerAccount": A user or group has reached the limit on the number of cores allowed. The rest of the user or group's jobs will be pending until the number of cores in use decreases.
- Reason code "Dependency": Dependencies among jobs or conditions that must be met before a job can run have not yet been satisfied.
You can place a hold on your own job using
scontrol hold jobid. If you do not understand the state of your job, contact OSC Help for assistance.
To list blocked jobs:
squeue -u usr1234 -t PENDING
--start option estimates the start time for a pending job. Unfortunately, these estimates are not at all accurate except for the highest priority job in the queue.
Why isn’t my job running?
There are many reasons that your job may have to wait in the queue longer than you would like, including:
- System load is high.
- A system downtime has been scheduled and jobs are being held. Check the system notices posted on the OSC Events page or the message of the day, displayed when you log in.
- You or your group are at the maximum processor count or running job count and your job is being held.
- Your job is requesting specialized resources, such as GPU nodes or large memory nodes or certain software licenses, that are in high demand and not available.
- Your job is requesting a lot of resources. It takes time for the resources to become available.
- Your job is requesting incompatible or nonexistent resources and can never run.
- Job is unnecessarily stuck in batch hold because of system problems (very rare).
Priority, backfill and debug reservations
Priority is a complicated function of many factors, including the processor count and walltime requested, the length of time the job has been waiting and more.
During each scheduling iteration, the scheduler will identify the highest priority job that cannot currently be run and find a time in the future to reserve for it. Once that is done, the scheduler will then try to backfill as many lower priority jobs as it can without affecting the highest priority job's start time. This keeps the overall utilization of the system high while still allowing reasonable turnaround time for high priority jobs. Short jobs and jobs requesting few resources are the easiest to backfill.
A small number of nodes are set aside during the day for jobs with a walltime limit of 1 hour or less, primarily for debugging purposes.
Observing a running job
You can monitor a running batch job as easily as you can monitor a program running interactively. Simply view the output file in read only mode to check the current output of the job.
You may check the status of a node while the job is running by visiting the OSC grafana page and using the "cluster metrics" report.
Managing your jobs
Deleting a job
Situations may arise that call for deletion of a job from the SLURM queue, such as incorrect resource limits, missing or incorrect input files or commands or a program taking too long to run (infinite loop).
The command to delete a batch job is
scancel. It applies to both queued and running jobs.
If you cannot delete one of your jobs, it may be because of a hardware problem or system software crash. In this case you should contact OSC Help.
Altering a queued job
You can alter certain attributes of a job in the queue using the
scontrol update command. Use this command to make a change without losing your place in the queue. Please note that you cannot make any alterations to the executable portion of the script, nor can you make any changes after the job starts running.
The syntax is:
The optional arguments consist of one or more SLURM directives in the form of command-line options.
For example, to change the walltime limit on job 123456 to 5 hours and have email sent when the job ends (only):
scontrol update job=123456 timeLimit=5:00:00 mailType=End
Placing a hold on a queued job
If you want to prevent a job from running but leave it in the queue, you can place a hold on it using the
scontrol hold command. The job will remain pending until you release it with the
scontrol release command. A hold can be useful if you need to modify the input file for a job without losing your place in the queue.
scontrol hold 123456 scontrol release 123456
Include the following commands in your batch script as appropriate to collect job statistics or performance information.
A simple way to view job information is to use this command at the end of the job:
scontrol show job=$SLURM_JOB_ID
date command prints the current date and time. It can be informative to include it at the beginning and end of the executable portion of your script as a rough measure of time spent in the job.
time utility is used to measure the performance of a single command. It can be used for serial or parallel processes. Add
/usr/bin/time to the beginning of a command in the batch script:
/usr/bin/time myprog arg1 arg2
The result is provided in the following format:
- user time (CPU time spent running your program)
- system time (CPU time spent by your program in system calls)
- elapsed time (wallclock)
- percent CPU used
- memory, pagefault and swap statistics
- I/O statistics
These results are appended to the job's error log file. Note: Use the full path “
/usr/bin/time” to get all the information shown.