Out-of-Memory (OOM) or Excessive Memory Usage

Problem description

A common problem on our systems is for a user job to run a node out of memory or to use more than its allocated share of memory if the node is shared with other jobs.

If a job exhausts both the physical memory and the swap space on a node, it causes the node to crash. With a parallel job, there may be many nodes that crash. When a node crashes, the systems staff has to manually reboot and clean up the node. If other jobs were running on the same node, the users have to be notified that their jobs failed.

If your job requests less than a full node, for example, -l nodes=1:ppn=1, it may be scheduled on a node with other running jobs. In this case, your job is entitled to a memory allocation proportional to the number of cores requested. For example, if a system has 4GB per core and you request one core, it is your responsibility to make sure your job uses no more than 4GB. Otherwise your job will interfere with the execution of other jobs.

The memory limit you set in PBS does not work the way one might expect it to. The only thing the -l mem=xxx flag is good for is requesting a large-memory node. It does not cause your job to be allocated the requested amount of memory, nor does it limit your job’s memory usage.
Note that even if your job isn’t causing problems, swapping is extremely inefficient. Your job will run orders of magnitude slower than it would with effective memory management.

Background

Each node has a fixed amount of physical memory and a fixed amount of disk space designated as swap space. If your program and data don’t fit in physical memory, the virtual memory system writes pages from physical memory to disk as necessary and reads in the pages it needs. This is called swapping. If you use up all the memory and all the swap space, the node crashes with an out-of-memory error.

This explanation really applies to the total memory usage of all programs running on the system. If someone else’s program is using too much memory, it may be pages from your program that get swapped out, and vice versa. This is the reason we aggressively terminate programs using more than their share of memory when there are other jobs on the node.

In the world of high performance computing, swapping is almost always undesirable. If your program does a lot of swapping, it will spend most of its time doing disk I/O and won’t get much computation done. You should consider the suggestions below.

You can find the amount of memory on our systems by following the links on our Supercomputers page. You can see the memory and swap values for a node by running the Linux command free on the node. As shown below, a standard node on Pitzer has 187GB physical memory and 47GB swap space.

[p0123]$ free -mh
              total        used        free      shared  buff/cache   available
Mem:           187G        8.9G        176G        626M        2.4G        176G
Swap:           47G          0B         47G​

Suggested solutions

Here are some suggestions for fixing jobs that use too much memory. Feel free to contact OSC Help for assistance with any of these options.

Some of these remedies involve requesting more processors (cores) for your job. As a general rule, we require you to request a number of processors proportional to the amount of memory you require. You need to think in terms of using some fraction of a node rather than treating processors and memory separately. If some of the processors remain idle, that’s not a problem. Memory is just as valuable a resource as processors.

Request whole node or more processors

Jobs requesting less than a whole node are those that have nodes=1 with ppn<40 on Pitzer, for example nodes=1:ppn=1. These jobs can be problematic for two reasons. First, they are entitled to use an amount of memory proportional to the ppn value requested; if they use more they interfere with other jobs. Second, if they cause a node to crash, it typically affects multiple jobs and multiple users.

If you’re sure about your memory usage, it’s fine to request just the number of processors you need, as long as it’s enough to cover the amount of memory you need. If you’re not sure, play it safe and request all the processors on the node.

Standard Pitzer nodes have 4761MB per core.

Reduce memory usage

Consider whether your job’s memory usage is reasonable in light of the work it’s doing. The code itself typically doesn’t require much memory, so you need to look mostly at the data size.

If you’re developing the code yourself, look for memory leaks. In MATLAB look for large arrays that can be cleared.

An out-of-core algorithm will typically use disk more efficiently than an in-memory algorithm that relies on swapping. Some third-party software gives you a choice of algorithms or allows you to set a limit on the memory the algorithm will use.

Use more nodes for a parallel job

If you have a parallel job you can get more total memory by requesting more nodes. Depending on the characteristics of your code you may also need to run fewer processes per node.

Here’s an example. Suppose your job on Pitzer includes the following lines:

#PBS -l nodes=2:ppn=40
…
mpiexec mycode

This job uses 2 nodes, so it has 2*183=366GB total memory available to it. The mpiexec command by default runs one process per core, which in this case is 2*40=80 copies of mycode.

If this job uses too much memory you can spread those 80 processes over more nodes. The following lines request 4 nodes, giving you a total of 4*183=732GB total memory. The -ppn 20 option on the mpiexec command says to run 20 processes per node instead of 40, for a total of 80 as before.

#PBS -l nodes=4:ppn=40
…
mpiexec -ppn 20 mycode

Since parallel jobs are always assigned whole nodes, the following lines will also run 20 processes per node on 4 nodes.

#PBS -l nodes=4:ppn=20
…
mpiexec mycode

Request large-memory nodes

Pitzer has four huge memory nodes with 3TB of memory and with 80 cores. Owens has sixteen huge memory nodes with 1.5 TB of memory and with 48 cores.

Since there are so few of these nodes, compared to hundreds of standard nodes, jobs requesting them will often have a long wait in the queue. The wait will be worthwhile, though, If these nodes solve your memory problem.

To use the huge memory nodes on Pitzer, request the whole node (ppn=80). 

Example:

#PBS -l nodes=1:ppn=80
…

To use the huge-memory node on Owens you request the whole node (ppn=48) as well.

#PBS -l nodes=1:ppn=48
…

Put a virtual memory limit on your job

The sections above are intended to help you get your job running correctly. This section is about forcing your job to fail gracefully if it consumes too much memory. If your memory usage is unpredictable, it is preferable to terminate the job when it exceeds a memory usage limit rather than allow it to crowd other jobs or crash a node.

The memory limit enforced by PBS is ineffective because it only limits physical memory usage (resident set size or RSS). When your job reaches its memory limit it simply starts using virtual memory, or swap. PBS allows you to put a limit on virtual memory, but that has problems also.

We will use Linux terminology. Each process has several virtual memory values associated with it. VmSize is virtual memory size; VmRSS is resident set size, or physical memory used; VmSwap is swap space used. The number we care about is the total memory used by the process, which is VmRSS + VmSwap. What PBS allows a job to limit is VmRSS (using -l mem=xxx) or VmSize (using -l vmem=xxx).

The relationship among VmSize, VmRSS, and VmSwap is:  VmSize >= VmRSS+VmSwap. For many programs this bound is fairly tight; for others VmSize can be much larger than the memory actually used.

If the bound is reasonably tight, -l vmem=4gb provides an effective mechanism for limiting memory usage to 4gb (for example). If the bound is not tight, VmSize may prevent the program from starting even if VmRSS+VmSwap would have been perfectly reasonable. Java and some FORTRAN 77 programs in particular have this problem.

The vmem limit in PBS is for the entire job, not just one node, so it isn’t useful with parallel (multimode) jobs. PBS also has a per-process virtual memory limit, pvmem. This limit is trickier to use, but it can be useful in some cases.

Here are suggestions for some specific cases.

Serial (single-node) job using program written in C/C++

This case applies to programs written in any language if VmSize is not much larger than VmRSS+VmSwap. If your program doesn’t use any swap space, this means that vmem as reported by qstat -f or the ja command (see below) is not much larger mem as reported by the same tools.

Set the vmem limit equal to, or slightly larger than, the number of processors requested (ppn) times the memory available per processor. Example for Pitzer:

#PBS -l nodes=1:ppn=1
#PBS -l vmem=4761mb
Parallel (multinode) job using program written in C/C++

This suggestion applies if your processes use approximately equal amounts of memory. See also the comments about other languages under the previous case.

Set the pvmem limit equal to, or slightly larger than, the amount of physical memory on the node divided by the number of processes per node. Example for Pitzer, running 40 processes per node:

#PBS -l nodes=2:ppn=40
#PBS -l pvmem=4761mb
…
mpiexec mycode
Serial (single-node) job using program written in Java

I’ve only slightly tested this suggestion so far, so please provide feedback to oschelp@osc.edu.

Start Java with a virtual memory limit equal to, or slightly larger than, the number of processors requested (ppn) times the memory available per processor. Example for Pitzer:

#PBS -l nodes=1:ppn=1
#PBS -l vmem=4761mb
…
java -Xms4761m -Xmx4761m MyJavaCode
Other situations

If you have other situations that aren’t covered here, please share them. Contact oschelp@osc.edu.

How to monitor your memory usage

qstat -f

While your job is running the command qstat -f jobid will tell you the peak physical and virtual memory usage of the job so far. For a parallel job, these numbers are the aggregate usage across all nodes of the job. The values reported by qstat may lag the true values by a couple of minutes.

free

For parallel (multinode) jobs you can check your per-node memory usage while your job is running by using pdsh -j jobid free -m.

ja

You can put the command ja (job accounting) at the end of your batch script to capture the resource usage reported by qstat -f. The information will be written to your job output log, job_name.o123456.

OnDemand

You can also view node status graphically the OSC OnDemand Portal (ondemand.osc.edu).  Under "Jobs" select "Active Jobs". Click on "Job Status" and scroll down to see memory usage. This shows the total memory usage for the node; if your job is not the only one running there, it may be hard to interpret.

Below is a typical graph for jobs using too much memory. It shows two jobs that ran back-to-back on the same node. The first peak is a job that used all the available physical memory (blue) and a large amount of swap (purple). It completed successfully without crashing the node. The second job followed the same pattern but actually crashed the node.

Notes

If it appears that your job is close to crashing a node, we may preemptively delete the job.

If your job is interfering with other jobs by using more memory than it should be, we may delete the job.

In extreme cases OSC staff may restrict your ability to submit jobs. If you crash a large number of nodes or continue to submit problem jobs after we have notified you of the situation, this may be the only way to protect the system and our other users. If this happens, we will restore your privileges as soon as you demonstrate that you have resolved the problem.

For details on retrieving files from unexpectedly terminated jobs see this FAQ.

For assistance

OSC has staff available to help you resolve your memory issues. See our Support Services page for contact information.