batch email notifications

Occasionally, jobs that experience problems may generate emails from staff or automated systems at the center with some information about the nature of the problem. This page provides additional information about the various emails sent, and steps that can be taken to address the problem.

batch emails

All emails from osc about jobs will come from slurm@osc.edu, oschelp@osc.edu, or an email address with the domain @osc.edu

regular job emails

These emails can be turned on/off using the appropriate slurm directives. Other email addresses can also be specified. See the mail options section of job scripts page.

Email type Description
job began/end Job began or ended. These are normal emails.
job aborted Job has ended in an abnormal state.

other emails

There is no option to turn these emails off, as they require us to contact the user that submitted the job. We can work with you if they will be expected. Please contact OSC Help in this case.

Email type Description
Deleted by administrator

OSC staff may delete running jobs if:

  • The job is using so much memory that it threatens to crash the node it is running on.
  • The job is using more resources than it requested and is interfering with other jobs running on the same node.
  • The job is causing excessive load on some part of the system, typically a network file server.
  • The job is still running at the start of a scheduled downtime.

OSC staff may delete queued jobs if:

  • The job requests non-existent resources.
  • A job intended for one system that was submitted on another one.
  • The job can never run because it requests combinations of resources that are disallowed by policy.
  • The user’s credentials are blocked on the system the job was submitted on.
Emails exceed expected volume Job emails may be delayed if too many are queued to be sent to a single email address. This is to prevent OSC from being blacklisted by the email server.
failure due to hardware/software problem The node(s) or software that a job was using had a critical issue and the job failed.
overuse of physical memory (RAM)

The node that was in use crashed due to it being out of memory.

See out-of-memory (OOM) or excessive memory usage page for more information.

Job requeued A job may be requeued explicitly by a system administrator or after a node failure.
GPFS unmount

An issue with gpfs may have affected the job. This includes directories located in:

  • /fs/ess
Filling up /tmp

Job failed after exhausting the space in a node's local /tmp directory. 

Please request either an entire node or use scratch. 

For assistance

Contact OSC Help for assistance if there are any questions.