A common problem on our systems is for a user job to run a node out of memory or to use more than its allocated share of memory if the node is shared with other jobs.
If a job exhausts both the physical memory and the swap space on a node, it causes the node to crash. With a parallel job, there may be many nodes that crash. When a node crashes, the systems staff has to manually reboot and clean up the node. If other jobs were running on the same node, the users have to be notified that their jobs failed.
If your job requests less than a full node, for example,
-l nodes=1:ppn=1, it may be scheduled on a node with other running jobs. In this case, your job is entitled to a memory allocation proportional to the number of cores requested. For example, if a system has 4GB per core and you request one core, it is your responsibility to make sure your job uses no more than 4GB. Otherwise your job will interfere with the execution of other jobs.
-l mem=xxxflag is good for is requesting a large-memory node. It does not cause your job to be allocated the requested amount of memory, nor does it limit your job’s memory usage.
Each node has a fixed amount of physical memory and a fixed amount of disk space designated as swap space. If your program and data don’t fit in physical memory, the virtual memory system writes pages from physical memory to disk as necessary and reads in the pages it needs. This is called swapping. If you use up all the memory and all the swap space, the node crashes with an out-of-memory error.
This explanation really applies to the total memory usage of all programs running on the system. If someone else’s program is using too much memory, it may be pages from your program that get swapped out, and vice versa. This is the reason we aggressively terminate programs using more than their share of memory when there are other jobs on the node.
In the world of high performance computing, swapping is almost always undesirable. If your program does a lot of swapping, it will spend most of its time doing disk I/O and won’t get much computation done. You should consider the suggestions below.
You can find the amount of memory on our systems by following the links on our Supercomputers page. You can see the memory and swap values for a node by running the Linux command
free on the node. As shown below, a standard node on Oakley has 48GB physical memory and 46GB swap space.
[n0123]$ free -mo total used free shared buffers cached Mem: 48386 2782 45603 0 161 1395 Swap: 46874 0 46874
Here are some suggestions for fixing jobs that use too much memory. Feel free to contact OSC Help for assistance with any of these options.
Some of these remedies involve requesting more processors (cores) for your job. As a general rule we require you to request a number of processors proportional to the amount of memory you require. You need to think in terms of using some fraction of a node rather than treating processors and memory separately. If some of the processors remain idle, that’s not a problem. Memory is just as valuable a resource as processors.
Request whole node or more processors
Jobs requesting less than a whole node are those that have nodes=1 with ppn<12 on Oakley, for example
nodes=1:ppn=1. These jobs can be problematic for two reasons. First, they are entitled to use an amount of memory proportional to the ppn value requested; if they use more they interfere with other jobs. Second, if they cause a node to crash, it typically affects multiple jobs and multiple users.
If you’re sure about your memory usage, it’s fine to request just the number of processors you need, as long as it’s enough to cover the amount of memory you need. If you’re not sure, play it safe and request all the processors on the node.
Standard Oakley nodes have 4GB per core.
Reduce memory usage
Consider whether your job’s memory usage is reasonable in light of the work it’s doing. The code itself typically doesn’t require much memory, so you need to look mostly at the data size.
If you’re developing the code yourself, look for memory leaks. In MATLAB look for large arrays that can be cleared.
An out-of-core algorithm will typically use disk more efficiently than an in-memory algorithm that relies on swapping. Some third-party software gives you a choice of algorithms or allows you to set a limit on the memory the algorithm will use.
Use more nodes for a parallel job
If you have a parallel job you can get more total memory by requesting more nodes. Depending on the characteristics of your code you may also need to run fewer processes per node.
Here’s an example. Suppose your job on Oakley includes the following lines:
#PBS -l nodes=5:ppn=12 … mpiexec mycode
This job uses 5 nodes, so it has 5*48=240GB total memory available to it. The
mpiexec command by default runs one process per core, which in this case is 5*12=60 copies of mycode.
If this job uses too much memory you can spread those 60 processes over more nodes. The following lines request 10 nodes, giving you a total of 10*48=480GB total memory. The
-ppn 6 option on the
mpiexec command says to run 6 processes per node instead of 12, for a total of 60 as before.
#PBS -l nodes=10:ppn=12 … mpiexec -ppn 6 mycode
Since parallel jobs are always assigned whole nodes, the following lines will also run 6 processes per node on 10 nodes.
#PBS -l nodes=10:ppn=6 … mpiexec mycode
Request large-memory nodes
Oakley has eight nodes with 192GB each, four times the memory of a standard node. Oakley also has one huge-memory node with 1TB of memory; it has 32 cores.
Since there are so few of these nodes, compared to hundreds of standard nodes, jobs requesting them will often have a long wait in the queue. The wait will be worthwhile, though, If these nodes solve your memory problem.
To use the large-memory nodes on Oakley, request between 48gb and 192gb memory and 1 to 12 processors per node. Remember to request a number of processors per node proportional to your memory requirements. In most cases you’ll want to request the whole node (
ppn=12). You can request up to 8 nodes but the more you request the longer your queue wait is likely to be.
#PBS -l nodes=1:ppn=12 #PBS -l mem=192gb …
To use the huge-memory node on Oakley you must request the whole node (
ppn=32). Let the memory default.
#PBS -l nodes=1:ppn=32 …
Put a virtual memory limit on your job
The sections above are intended to help you get your job running correctly. This section is about forcing your job to fail gracefully if it consumes too much memory. If your memory usage is unpredictable, it is preferable to terminate the job when it exceeds a memory usage limit rather than allow it to crowd other jobs or crash a node.
The memory limit enforced by PBS is ineffective because it only limits physical memory usage (resident set size or RSS). When your job reaches its memory limit it simply starts using virtual memory, or swap. PBS allows you to put a limit on virtual memory, but that has problems also.
We will use Linux terminology. Each process has several virtual memory values associated with it. VmSize is virtual memory size; VmRSS is resident set size, or physical memory used; VmSwap is swap space used. The number we care about is the total memory used by the process, which is VmRSS + VmSwap. What PBS allows a job to limit is VmRSS (using
-l mem=xxx) or VmSize (using
The relationship among VmSize, VmRSS, and VmSwap is: VmSize >= VmRSS+VmSwap. For many programs this bound is fairly tight; for others VmSize can be much larger than the memory actually used.
If the bound is reasonably tight,
-l vmem=4gb provides an effective mechanism for limiting memory usage to 4gb (for example). If the bound is not tight, VmSize may prevent the program from starting even if VmRSS+VmSwap would have been perfectly reasonable. Java and some FORTRAN 77 programs in particular have this problem.
vmem limit in PBS is for the entire job, not just one node, so it isn’t useful with parallel (multimode) jobs. PBS also has a per-process virtual memory limit,
pvmem. This limit is trickier to use, but it can be useful in some cases.
Here are suggestions for some specific cases.
Serial (single-node) job using program written in C/C++
This case applies to programs written in any language if VmSize is not much larger than VmRSS+VmSwap. If your program doesn’t use any swap space, this means that
vmem as reported by
qstat -f or the
ja command (see below) is not much larger mem as reported by the same tools.
vmem limit equal to, or slightly larger than, the number of processors requested (
ppn) times the memory available per processor. Example for Oakley:
#PBS -l nodes=1:ppn=1 #PBS -l vmem=4gb
Parallel (multinode) job using program written in C/C++
This suggestion applies if your processes use approximately equal amounts of memory. See also the comments about other languages under the previous case.
pvmem limit equal to, or slightly larger than, the amount of physical memory on the node divided by the number of processes per node. Example for Oakley, running 12 processes per node:
#PBS -l nodes=5:ppn=12 #PBS -l pvmem=4gb … mpiexec mycode
Serial (single-node) job using program written in Java
I’ve only slightly tested this suggestion so far, so please provide feedback to firstname.lastname@example.org.
Start Java with a virtual memory limit equal to, or slightly larger than, the number of processors requested (
ppn) times the memory available per processor. Example for Oakley:
#PBS -l nodes=1:ppn=1 #PBS -l vmem=4gb … java -Xms4096m -Xmx4096m MyJavaCode
If you have other situations that aren’t covered here, please share them. Contact email@example.com.
How to monitor your memory usage
While your job is running the command
qstat -f jobid will tell you the peak physical and virtual memory usage of the job so far. For a parallel job, these numbers are the aggregate usage across all nodes of the job. The values reported by qstat may lag the true values by a couple of minutes.
For parallel (multinode) jobs you can check your per-node memory usage while your job is running by using
pdsh -j jobid free -mo
You can put the command
ja (job accounting) at the end of your batch script to capture the resource usage reported by
qstat -f. The information will be written to your job output log,
You can also view node status graphically the OSC OnDemand Portal (ondemand.osc.edu). Under "Jobs" select "Active Jobs". Click on "Job Status" and scroll down to see memory usage. This shows the total memory usage for the node; if your job is not the only one running there, it may be hard to interpret.
Below is a typical graph for jobs using too much memory. It shows two jobs that ran back-to-back on the same node. The first peak is a job that used all the available physical memory (blue) and a large amount of swap (purple). It completed successfully without crashing the node. The second job followed the same pattern but actually crashed the node.
If it appears that your job is close to crashing a node, we may preemptively delete the job.
If your job is interfering with other jobs by using more memory than it should be, we may delete the job.
In extreme cases OSC staff may restrict your ability to submit jobs. If you crash a large number of nodes or continue to submit problem jobs after we have notified you of the situation, this may be the only way to protect the system and our other users. If this happens, we will restore your privileges as soon as you demonstrate that you have resolved the problem.
For details on retrieving files from unexpectedly terminated jobs see this FAQ.
OSC has staff available to help you resolve your memory issues. See our Support Services page for contact information.