Knowledge Base

This knowledge base is a collection of important, useful information about OSC systems that does not fit into a guide or tutorial, and is too long to be answered in a simple FAQ.

Compilation Guide

As a general recommendation, we suggest selecting the newest compilers available for a new project. For repeatability, you may not want to change compilers in the middle of an experiment.

Owens Compilers

The Haswell and Broadwell processors that make up Owens support the Advanced Vector Extensions (AVX2) instruction set, but you must set the correct compiler flags to take advantage of it. AVX2 has the potential to speed up your code by a factor of 4 or more, depending on the compiler and options you would otherwise use.

With the Intel compilers, use -xHost and -O2 or higher. With the gnu compilers, use -march=native and -O3. The PGI compilers by default use the highest available instruction set, so no additional flags are necessary.

This advice assumes that you are building and running your code on Owens. The executables will not be portable.

Intel (recommended)

  NON-MPI MPI
FORTRAN 90 ifort mpif90
C icc mpicc
C++ icpc mpicxx

Recommended Optimization Options

The   -O2 -xHost  options are recommended with the Intel compilers. (For more options, see the "man" pages for the compilers.

OpenMP

Add this flag to any of the above:  -qopenmp  or  -openmp

PGI

  NON-MPI MPI
FORTRAN 90 pgfortran   or   pgf90 mpif90
C pgcc mpicc
C++ pgc++ mpicxx

Recommended Optimization Options

The   -fast  option is appropriate with all PGI compilers. (For more options, see the "man" pages for the compilers)

Note: The PGI compilers can generate code for accelerators such as GPUs. Description of these capabilities is beyond the scope of this guide.

OpenMP

Add this flag to any of the above:  -mp

GNU

  NON-MPI MPI
FORTRAN 90 gfortran mpif90
C gcc mpicc
C++ g++ mpicxx

Recommended Optimization Options

The  -O2 -march=native  options are recommended with the GNU compilers. (For more options, see the "man" pages for the compilers)

OpenMP

Add this flag to any of the above:  -fopenmp

 

Ruby Compilers

Intel (recommended)

  NON-MPI MPI
FORTRAN 90 ifort mpif90
C icc mpicc
C++ icpc mpicxx

Recommended Optimization Options

The  -O2 -xHost  options are recommended with the Intel compilers. (For more options, see the "man" pages for the compilers.

OpenMP

Add this flag to any of the above: -qopenmp or -openmp

PGI

  NON-MPI MPI
FORTRAN 90 pgfortran  or  pgf90 mpif90
C pgcc mpicc
C++ pgc++ mpicxx
NOTE: The C++ compiler used to be pgCC, but newer versions of PGI do not support this name.

Recommended Optimization Options

The  -fast  option is appropriate with all PGI compilers. (For more options, see the "man" pages for the compilers)

Note: The PGI compilers can generate code for accelerators such as GPUs. Description of these capabilities is beyond the scope of this guide.

OpenMP

Add this flag to any of the above: -mp

GNU

  NON-MPI MPI
FORTRAN 90 gfortran mpif90
C gcc mpicc
C++ g++ mpicxx

Recommended Optimization Options

The -O2 -march=native  options are recommended with the GNU compilers. (For more options, see the "man" pages for the compilers)

OpenMP

Add this flag to any of the above: -fopenmp

 

Oakley Compilers

Intel (Recommended)

  non-MPI MPI
Fortran ifort mpif90
C icc mpicc
C++ icpc mpicxx

Recommended Optimization Options

Sequential (not numerically sensitive) -fast
Sequential (numerically sensitive) -ipo -O2 -static -xHost
MPI (not numerically sensitive) -ipo -O3 -no-prec-div -xHost
MPI (numerically sensitive) -ipo -O2 -xHost
Note:  The -fast flag is equivalent to -ipo -O3 -no-prec-div -static -xHost .
Note:  Other options are available for code with extreme numerical sensitivity; their description is beyond the scope of this guide.
Note:  Intel 14.0.0.080 has a bug related to generation of portable code. Add the flag -msse3  to get around it.

OpenMP

Add this flag to any of the above: -qopenmp or -openmp

PGI

  non-MPI MPI
Fortran 90 or 95 pgfortran or pgf90 mpif90
Fortran 77 pgf77 mpif77
C pgcc mpicc
C++ pgc++ mpicxx

NOTE: The C++ compiler used to be pgCC, but newer versions of PGI do not support this name.

Recommended Optimization Options

The -fast  option is appropriate with all PGI compilers.  (For more options, see the "man" pages for the compilers)

Note: The PGI compilers can generate code for accelerators such as GPUs. Description of these capabilities is beyond the scope of this guide.

OpenMP

Add this flag to any of the above: -mp

GNU

  non-MPI MPI
Fortran 90 or 95 gfortran mpif90
Fortran 77 g77 mpif77
C gcc mpicc
C++ g++ mpicxx

Recommended Optimization Options

The -O3 -march=native options are recommended with the GNU compilers.  (For more options, see the "man" pages for the compilers)

OpenMP

Add this flag to any of the above (except g77 and mpif77): -fopenmp

Further Reading:

Intel Compiler Page

PGI Compiler Page

GNU Complier Page

Supercomputer: 
Technologies: 
Fields of Science: 

Firewall and Proxy Settings

Connections to OSC

In order for users to access OSC resources through the web your firewall rules should allow for connections to the following IP ranges.  Otherwise, users may be blocked or denied access to our services.

  • 192.148.248.0/24
  • 192.148.247.0/24
  • 192.157.5.0/25

The followingg TCP ports should be opened:

  • 80 (HTTP)
  • 443 (HTTPS)
  • 22 (SSH)

The following domain should be allowed:

  • *.osc.edu

Users who are unsure of whether their network is blocking theses hosts or ports should contact their local IT administrator.

Connections from OSC

All outbound network traffic from all of OSC's compute nodes are routed through a network address translation host (NAT), or two backup servers:

  • nat.osc.edu (192.157.5.13)
  • 192.148.248.35
  • 192.148.248.186

IT and Network Administrators

Please use the above information in order to assit users in acessing our resources.  

Occasionally new services may be stood up using hosts and ports not described here.  If you believe our list needs correcting please let us know at oschelp@osc.edu.

Supercomputer: 
Service: 

Messages from qsub

We have been adding some output from qsub that should aid you in creating better job scripts. We've documented the various messages here.

NOTE

A "NOTE" message is informational; your job has been submitted, but qsub made some assumptions about your job that you may not have intended.

No account/project specified

Your job did not specify a project to charge against, but qsub was able to select one for you. Typically, this will be because your username can only charge against one project, but it may be because you specified a preference by setting the OSC_DEFAULT_ACCOUNT environment variable. The output should indicate which project was assumed to be the correct one; if it was not correct, you should delete the job and resubmit after setting the correct job in the job script using the -A flag. For example:

#PBS -A PZS0530

Replace PZS0530 with the correct project code. Explicitly setting the -A flag will cause this informational message to not appear.

No memory limit set

Your job did not specify an explicit memory limit. Since we limit access to memory based on the number of cores set, qsub set this limit on your behalf, and will have mentioned in the message what the memory limit was set to.

You can suppress this informational message by explicitly setting the memory limit. For example:

#PBS -l mem=4gb

Please remember that the memory to core ratios are different on each cluster we operate. Please review the main documentation page for the cluster you are using for more information.

ERROR

A "ERROR" message indicates that your job was not submitted to the queue. Typically, this is because qsub is unsure of how to resolve an ambiguous setting in your job parameters. You will need to fix the problem in your job script, and resubmit.

You have not specified an account and have more than one available

Your username has the ability to charge jobs to more than one project, and qsub is unable to determine which one this job should be charged against. You can fix this by specifying the project using the -A flag. For example, you should add this line to your job script:

#PBS -A PZS0530

If you get this error, qsub will inform you of which projects you can charge against. Select the appropriate project, and replace "PZS0530" in the example above with the correct code.

You have the ability to tell qsub which project should be charged if no charge code is specified in the job script by setting the OSC_DEFAULT_ACCOUNT environment variable. For example, if you use the "bash" shell, you could put the line export OSC_DEFAULT_ACCOUNT=PZS0530, again replacing PZS0530 with the correct project code.

Supercomputer: 
Service: 

Migrating jobs from Glenn to Oakley or Ruby

This page includes a summary of differences to keep in mind when migrating jobs from Glenn to one of our other clusters.

Hardware

Most Oakley nodes have 12 cores and 48GB memory. There are eight large-memory nodes with 12 cores and 192GB memory, and one huge-memory node with 32 cores and 1TB of memory. Most Ruby nodes have 20 cores and 64GB of memory. There is one huge-memory node with 32 cores and 1TB of memory. By contrast, most Glenn nodes have 8 cores and 24GB memory, with eight nodes having 16 cores and 64GB memory.

Module System

Oakley and Ruby use a different module system than Glenn. It looks very similar, but it enforces module dependencies, and thus may prevent certain module combinations from being loaded that were permitted on Glenn. For example, only one compiler may be loaded at a time.

module avail will only show modules compatible with your currently loaded modules, but not all installed modules on the system. To see all modules on the cluster, use the command module spider. Both module avail and module spider can take a partial module name as a search parameter, such as module spider dyna.

Version numbers are indicated with a slash “/” rather than a dash “-” and need not be specified if you want the default version.

Compilers

Like Glenn, Oakley and Ruby support three compilers: Intel, PGI, and gnu. Unlike Glenn, Oakley and Ruby only let you have one compiler module loaded at any one time. The default is Intel. To switch to a different compiler, use module swap intel gnu or module swap intel pgi.

Important note: The gnu compilers are part of the Linux distribution, so they’re always available. It’s important to use the gnu module, however, to link with the correct libraries for MVAPICH, MKL, etc.

MPI

MPI-2 is available on Oakley and Ruby through the MVAPICH2 modules. The MVAPICH2 libraries are linked differently than on Glenn, requiring you to have the correct compiler and MVAPICH2 modules loaded at execution time as well as at compile time. (This doesn’t apply if you’re using a software package that was installed by OSC.)

Software you build and/or install

If your software uses any libraries installed by OSC, including MVAPICH, you will have to rebuild it. If you link to certain libraries, including MVAPICH, MKL, and others, you must have the same compiler module loaded at run time that you do at build time. Please refer to the compilation guide in our Knowledge Base for guidance on optimizing your compilations for our hardware.

OSC installed software

Most of the software installed on Glenn is also installed on Oakley or Ruby, although old versions may no longer be available. We recommend migrating to a newer version of the application if at all possible. Please review the software documentation to see what versions are available, and examine sample batch scripts.

Accounting

All OSC clusters currently use the same core-hour to RU conversion factor. Oakley will charge you for the number of cores proportional to the amount of memory your job requests, while Ruby only accepts full-node jobs. Please review the system documentation for each cluster.

“all” replaced by “pdsh”

The “all” command is not available on Oakley or Ruby; “pdsh” is available on all clusters.

pdsh –j jobid command

pdsh –g feature command

pdsh –w nodelist command

Supercomputer: 
Service: 

Out-of-Memory (OOM) or Excessive Memory Usage

Problem description

A common problem on our systems is for a user job to run a node out of memory or to use more than its allocated share of memory if the node is shared with other jobs.

If a job exhausts both the physical memory and the swap space on a node, it causes the node to crash. With a parallel job, there may be many nodes that crash. When a node crashes, the systems staff has to manually reboot and clean up the node. If other jobs were running on the same node, the users have to be notified that their jobs failed.

If your job requests less than a full node, for example, -l nodes=1:ppn=1, it may be scheduled on a node with other running jobs. In this case, your job is entitled to a memory allocation proportional to the number of cores requested. For example, if a system has 4GB per core and you request one core, it is your responsibility to make sure your job uses no more than 4GB. Otherwise your job will interfere with the execution of other jobs.

The memory limit you set in PBS does not work the way one might expect it to. The only thing the -l mem=xxx flag is good for is requesting a large-memory node. It does not cause your job to be allocated the requested amount of memory, nor does it limit your job’s memory usage.
Note that even if your job isn’t causing problems, swapping is extremely inefficient. Your job will run orders of magnitude slower than it would with effective memory management.

Background

Each node has a fixed amount of physical memory and a fixed amount of disk space designated as swap space. If your program and data don’t fit in physical memory, the virtual memory system writes pages from physical memory to disk as necessary and reads in the pages it needs. This is called swapping. If you use up all the memory and all the swap space, the node crashes with an out-of-memory error.

This explanation really applies to the total memory usage of all programs running on the system. If someone else’s program is using too much memory, it may be pages from your program that get swapped out, and vice versa. This is the reason we aggressively terminate programs using more than their share of memory when there are other jobs on the node.

In the world of high performance computing, swapping is almost always undesirable. If your program does a lot of swapping, it will spend most of its time doing disk I/O and won’t get much computation done. You should consider the suggestions below.

You can find the amount of memory on our systems by following the links on our Supercomputers page. You can see the memory and swap values for a node by running the Linux command free on the node. As shown below, a standard node on Oakley has 48GB physical memory and 46GB swap space.

[n0123]$ free -mo
             total       used       free     shared    buffers     cached
Mem:         48386       2782      45603          0        161       1395
Swap:        46874          0      46874

Suggested solutions

Here are some suggestions for fixing jobs that use too much memory. Feel free to contact OSC Help for assistance with any of these options.

Some of these remedies involve requesting more processors (cores) for your job. As a general rule we require you to request a number of processors proportional to the amount of memory you require. You need to think in terms of using some fraction of a node rather than treating processors and memory separately. If some of the processors remain idle, that’s not a problem. Memory is just as valuable a resource as processors.

Request whole node or more processors

Jobs requesting less than a whole node are those that have nodes=1 with ppn<12 on Oakley or ppn<8 on Glenn, for example nodes=1:ppn=1. These jobs can be problematic for two reasons. First, they are entitled to use an amount of memory proportional to the ppn value requested; if they use more they interfere with other jobs. Second, if they cause a node to crash, it typically affects multiple jobs and multiple users.

If you’re sure about your memory usage, it’s fine to request just the number of processors you need, as long as it’s enough to cover the amount of memory you need. If you’re not sure, play it safe and request all the processors on the node.

Standard Oakley nodes have 4GB per core; standard Glenn nodes have 3GB per core.

Reduce memory usage

Consider whether your job’s memory usage is reasonable in light of the work it’s doing. The code itself typically doesn’t require much memory, so you need to look mostly at the data size.

If you’re developing the code yourself, look for memory leaks. In MATLAB look for large arrays that can be cleared.

An out-of-core algorithm will typically use disk more efficiently than an in-memory algorithm that relies on swapping. Some third-party software gives you a choice of algorithms or allows you to set a limit on the memory the algorithm will use.

Use more nodes for a parallel job

If you have a parallel job you can get more total memory by requesting more nodes. Depending on the characteristics of your code you may also need to run fewer processes per node.

Here’s an example. Suppose your job on Oakley includes the following lines:

#PBS -l nodes=5:ppn=12
…
mpiexec mycode

This job uses 5 nodes, so it has 5*48=240GB total memory available to it. The mpiexec command by default runs one process per core, which in this case is 5*12=60 copies of mycode.

If this job uses too much memory you can spread those 60 processes over more nodes. The following lines request 10 nodes, giving you a total of 10*48=480GB total memory. The -ppn 6 option on the mpiexec command says to run 6 processes per node instead of 12, for a total of 60 as before.

#PBS -l nodes=10:ppn=12
…
mpiexec -ppn 6 mycode

Since parallel jobs are always assigned whole nodes, the following lines will also run 6 processes per node on 10 nodes.

#PBS -l nodes=10:ppn=6
…
mpiexec mycode

Request large-memory nodes

Oakley has eight nodes with 192GB each, four times the memory of a standard node. Oakley also has one huge-memory node with 1TB of memory; it has 32 cores.

Since there are so few of these nodes, compared to hundreds of standard nodes, jobs requesting them will often have a long wait in the queue. The wait will be worthwhile, though, If these nodes solve your memory problem.

To use the large-memory nodes on Oakley, request between 48gb and 192gb memory and 1 to 12 processors per node. Remember to request a number of processors per node proportional to your memory requirements. In most cases you’ll want to request the whole node (ppn=12). You can request up to 8 nodes but the more you request the longer your queue wait is likely to be.

Example:

#PBS -l nodes=1:ppn=12
#PBS -l mem=192gb
…

To use the huge-memory node on Oakley you must request the whole node (ppn=32). Let the memory default.

#PBS -l nodes=1:ppn=32
…

Put a virtual memory limit on your job

The sections above are intended to help you get your job running correctly. This section is about forcing your job to fail gracefully if it consumes too much memory. If your memory usage is unpredictable, it is preferable to terminate the job when it exceeds a memory usage limit rather than allow it to crowd other jobs or crash a node.

The memory limit enforced by PBS is ineffective because it only limits physical memory usage (resident set size or RSS). When your job reaches its memory limit it simply starts using virtual memory, or swap. PBS allows you to put a limit on virtual memory, but that has problems also.

We will use Linux terminology. Each process has several virtual memory values associated with it. VmSize is virtual memory size; VmRSS is resident set size, or physical memory used; VmSwap is swap space used. The number we care about is the total memory used by the process, which is VmRSS + VmSwap. What PBS allows a job to limit is VmRSS (using -l mem=xxx) or VmSize (using -l vmem=xxx).

The relationship among VmSize, VmRSS, and VmSwap is:  VmSize >= VmRSS+VmSwap. For many programs this bound is fairly tight; for others VmSize can be much larger than the memory actually used.

If the bound is reasonably tight, -l vmem=4gb provides an effective mechanism for limiting memory usage to 4gb (for example). If the bound is not tight, VmSize may prevent the program from starting even if VmRSS+VmSwap would have been perfectly reasonable. Java and some FORTRAN 77 programs in particular have this problem.

The vmem limit in PBS is for the entire job, not just one node, so it isn’t useful with parallel (multimode) jobs. PBS also has a per-process virtual memory limit, pvmem. This limit is trickier to use, but it can be useful in some cases.

Here are suggestions for some specific cases.

Serial (single-node) job using program written in C/C++

This case applies to programs written in any language if VmSize is not much larger than VmRSS+VmSwap. If your program doesn’t use any swap space, this means that vmem as reported by qstat -f or the ja command (see below) is not much larger mem as reported by the same tools.

Set the vmem limit equal to, or slightly larger than, the number of processors requested (ppn) times the memory available per processor. Example for Oakley:

#PBS -l nodes=1:ppn=1
#PBS -l vmem=4gb
Parallel (multinode) job using program written in C/C++

This suggestion applies if your processes use approximately equal amounts of memory. See also the comments about other languages under the previous case.

Set the pvmem limit equal to, or slightly larger than, the amount of physical memory on the node divided by the number of processes per node. Example for Oakley, running 12 processes per node:

#PBS -l nodes=5:ppn=12
#PBS -l pvmem=4gb
…
mpiexec mycode
Serial (single-node) job using program written in Java

I’ve only slightly tested this suggestion so far, so please provide feedback to judithg@osc.edu.

Start Java with a virtual memory limit equal to, or slightly larger than, the number of processors requested (ppn) times the memory available per processor. Example for Oakley:

#PBS -l nodes=1:ppn=1
#PBS -l vmem=4gb
…
java -Xms4096m -Xmx4096m MyJavaCode
Other situations

If you have other situations that aren’t covered here, please share them. Contact judithg@osc.edu.

How to monitor your memory usage

qstat -f

While your job is running the command qstat -f jobid will tell you the peak physical and virtual memory usage of the job so far. For a parallel job, these numbers are the aggregate usage across all nodes of the job. The values reported by qstat may lag the true values by a couple of minutes.

free

For parallel (multinode) jobs you can check your per-node memory usage while your job is running by using pdsh -j jobid free -mo on Oakley or all -j jobid free -mo on Glenn.

ja

You can put the command ja (job accounting) at the end of your batch script to capture the resource usage reported by qstat -f. The information will be written to your job output log, job_name.o123456.

OnDemand

You can also view node status graphically the OSC OnDemand Portal (ondemand.osc.edu).  Under "Jobs" select "Active Jobs". Click on "Job Status" and scroll down to see memory usage. This shows the total memory usage for the node; if your job is not the only one running there, it may be hard to interpret.

Below is a typical graph for jobs using too much memory. It shows two jobs that ran back-to-back on the same node. The first peak is a job that used all the available physical memory (blue) and a large amount of swap (purple). It completed successfully without crashing the node. The second job followed the same pattern but actually crashed the node.

Notes

If it appears that your job is close to crashing a node, we may preemptively delete the job.

If your job is interfering with other jobs by using more memory than it should be, we may delete the job.

In extreme cases OSC staff may restrict your ability to submit jobs. If you crash a large number of nodes or continue to submit problem jobs after we have notified you of the situation, this may be the only way to protect the system and our other users. If this happens, we will restore your privileges as soon as you demonstrate that you have resolved the problem.

For assistance

OSC has staff available to help you resolve your memory issues. See our Support Services page for contact information.

System Email

Occasionally, jobs that experience problems may generate emails from staff or automated systems at the center with some information about the nature of the problem. These pages provide additional information about the various emails sent, and steps that can be taken to address the problem.

Batch job aborted

Purpose

Notify you when your job terminates abnormally.

Sample subject line

PBS JOB 944666.oak-batch.osc.edu

Apparent sender

  • root <adm@oak-batch.osc.edu> (Oakley)
  • root <pbs-opt@hpc.osc.edu> (Glenn)

Sample contents

PBS Job Id: 935619.oak-batch.osc.edu
Job Name:   mailtest.job
Exec host:  n0587/5
Aborted by PBS Server
Job exceeded some resource limit (walltime, mem, etc.). Job was aborted See Administrator for help

Sent under these circumstances

These are fully automated emails send by the batch system.

Some reasons a job might terminate abnormally:

  • The job exceeded its allotted walltime, memory, virtual memory, or other limited resource. More information is available in your job log file, e.g., jobname.o123456.
  • An unexpected system problem caused your job to fail.

To turn off the emails

There is no way to turn them off at this time.

To prevent these problems

For advice on monitoring and controlling resource usage, see Monitoring and Managing Your Job.

There’s not much you can do about system failures, which fortunately are rare.

Notes

Under some circumstances you can retrieve your job output log if your job aborts due to a system failure. Contact oschelp@osc.edu for assistance.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Batch job begin or end

Purpose

Notify you when your job begins or ends.

Sample subject line

PBS JOB 944666.oak-batch.osc.edu

Apparent sender

  • root <adm@oak-batch.osc.edu> (Oakley)
  • root <pbs-opt@hpc.osc.edu> (Glenn)

Sample contents

PBS Job Id: 944666.oak-batch.osc.edu
Job Name:   mailtest.job
Exec host:  n0587/1
Begun execution
 
PBS Job Id: 944666.oak-batch.osc.edu
Job Name:   mailtest.job
Exec host:  n0587/1
Execution terminated
Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=2228kb
resources_used.vmem=211324kb
resources_used.walltime=00:01:00

Sent under these circumstances

These are fully automated emails sent by the batch system. You control them through the headers in your job script. The following line requests emails at the beginning, ending, and abnormal termination of your job.

#PBS -m abe

To turn off the emails

Remove the -m option from your script and/or command line or use -m n. See PBS Directives Summary.

Notes

You can add the following command at the end of your script to have resource information written to your job output log:

ja

For more information

See PBS Directives Summary.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Batch job deleted by an administrator

Purpose

Notify you when your job is deleted by an administrator.

Sample subject line

PBS JOB 9657213.opt-batch.osc.edu

Apparent sender

  • root adm@oak-batch.osc.edu (Oakley)
  • root pbs-opt@hpc.osc.edu (Glenn)

Sample contents

PBS Job Id: 9657213.opt-batch.osc.edu
Job Name:   mailtest.job
job deleted
Job deleted at request of staff@opt-login04.osc.edu Job using too much memory. Contact oschelp@osc.edu.

Sent under these circumstances

These emails are sent automatically, but the administrator can add a note with the reason.

Some reasons a running job might be deleted:

  • The job is using so much memory that it threatens to crash the node it is running on.
  • The job is using more resources than it requested and is interfering with other jobs running on the same node.
  • The job is causing excessive load on some part of the system, typically a network file server.
  • The job is still running at the start of a scheduled downtime.

Some reasons a queued job might be deleted:

  • The job requests non-existent resources.
  • A job apparently intended for Oakley (ppn=12) was submitted on Glenn.
  • The job can never run because it requests combinations of resources that are disallowed by policy.
  • The user’s credentials are blocked on the system the job was submitted on.

To turn off the emails

There is no way to turn them off at this time.

To prevent these problems

See the Supercomputing FAQ for suggestions on dealing with specific problems.

For assistance

We will work with you to get your jobs running within the constraints of the system. Contact OSC Help for assistance. See our Support Services page for more contact information.

Emails exceeded the expected volume

Purpose

Notify you that we have placed a hold on emails sent to you from the HPC system.

Sample subject line

Emails sent to email address student@buckeyemail.osu.edu in the last hour exceeded the expected volume

Apparent sender

OSC Help <OSCHelp@osc.edu>

Explanation

When a job fails or is deleted by an administrator, the system sends you an email. If this happens with a large number of jobs, it generates a volume of email that may be viewed as spam by your email provider. To avoid having OSC blacklisted, and to avoid overloading your email account, we hold your emails from OSC.

Please note that these held emails will eventually be deleted if you do not contact us.

Sent under these circumstances

These emails are sent automatically when your email usage from OSC is deferred.

To turn off the emails

Turn off emails related to your batch jobs to reduce your overall email volume from OSC. See the -m option on the PBS Directives Summary page.

Notes

To re-enable email you must contact OSC Help.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

 

 

File system load problem

Purpose

Notify you that one or more of your jobs caused excessive load on one of the network file system directory servers.

Sample subject line

Your jobs on Oakley are causing excessive load on fs14

Apparent sender

OSC Help <OSCHelp@osc.edu> or an individual staff member

Explanation

Your jobs are causing problems with one of the network file servers. This is usually caused by submitting a large number of jobs that start at the same time and execute in lockstep.

Sent under these circumstances

These emails are sent by a staff member when the high load is traced to your jobs. Often the jobs have to be stopped or deleted.

To turn off the emails

You cannot turn off these emails. Please don’t ignore them because they report a problem that you must correct.

To prevent these problems

See the Knowledge Base article (coming soon) for suggestions on dealing with file system load problems.

For information on the different file systems available at OSC, see Available File Systems.

Notes

If you continue to submit jobs that cause these problems, your HPC account may be blocked.

For assistance

We will work with you to get your jobs running within the constraints of the system. Contact OSC Help for assistance. See our Support Services page for more contact information.

Job failure due to a system hardware problem

Purpose

Notify you that one or more of your jobs was running on a compute node that crashed due to a hardware problem.

Sample subject line

Failure of job(s) 919137 due to a hardware problem at OSC

Apparent sender

OSC Help <OSCHelp@osc.edu>

Explanation

Your job failed and was not at fault. You should resubmit the job.

Sent under these circumstances

These emails are sent by a systems administrator after a node crashes.

To turn off the emails

We don’t have a mechanism to turn off these emails. If they really bother you, contact OSC Help and we’ll try to accommodate you.

To prevent these problems

Hardware crashes are quite rare and in most cases there’s nothing you can do to prevent them. Certain types of bus errors on Glenn correlate strongly with certain applications (suggesting that they’re not really hardware errors). If you encounter this type of error you may be advised to use Oakley rather than Glenn.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Job failure due to a system software problem

Purpose

Notify you that one or more of your jobs was running on a compute node that crashed due to a system software problem.

Sample subject line

Failure of job(s) 919137 due to a system software problem at OSC

Apparent sender

OSC Help <OSCHelp@osc.edu>

Explanation

Your job failed and was not at fault. You should resubmit the job. Usually the problems are caused by another job running on the node.

Sent under these circumstances

These emails are sent by a systems administrator as part of the node cleanup process.

To turn off the emails

We don’t have a mechanism to turn off these emails. If they really bother you, contact OSC Help and we’ll try to accommodate you.

To prevent these problems

If you request a whole node (nodes=1:ppn=12 on Oakley or nodes=1:ppn=8 on Glenn) your jobs will be less susceptible to problems caused by other jobs. Other than that, be assured that we work hard to keep jobs from interfering with each other.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Job failure due to exhaustion of physical memory

Purpose

Notify you that one or more of your jobs caused compute nodes to crash with an out-of-memory error.

Sample subject line

Failure of job(s) 933014,933174 at OSC due to exhaustion of physical memory

Apparent sender

OSC Help <oschelp@osc.edu>

Explanation

Your job(s) exhausted both physical memory and swap space during job execution. This failure caused the compute node(s) used by the job(s) to crash, requiring a reboot.

Sent under these circumstances

These emails are sent by a systems administrator as part of the node cleanup process.

To turn off the emails

You cannot turn off these emails. Please don’t ignore them because they report a problem that you must correct.

To prevent these problems

See the Knowledge Base article "Out-of-Memory (OOM) or Excessive Memory Usage" for suggestions on dealing with out-of-memory problems.

For information on the memory available on the various systems, see our Supercomputing page.

Notes

If you continue to submit jobs that cause these problems, your HPC account may be blocked.

For assistance

We will work with you to get your jobs running within the constraints of the system. Contact OSC Help for assistance. See our Support Services page for more contact information.