A common problem on our systems is for a user job to run a node out of memory or to use more than its allocated share of memory if the node is shared with other jobs.
If a job exhausts both the physical memory and the swap space on a node, it causes the node to crash. With a parallel job, there may be many nodes that crash. When a node crashes, the systems staff has to manually reboot and clean up the node. If other jobs were running on the same node, the users have to be notified that their jobs failed.
If your job requests less than a full node, for example,
-l nodes=1:ppn=1, it may be scheduled on a node with other running jobs. In this case, your job is entitled to a memory allocation proportional to the number of cores requested. For example, if a system has 4GB per core and you request one core, it is your responsibility to make sure your job uses no more than 4GB. Otherwise your job will interfere with the execution of other jobs.
-l mem=xxxflag is good for is requesting a large-memory node. It does not cause your job to be allocated the requested amount of memory, nor does it limit your job’s memory usage.
Each node has a fixed amount of physical memory and a fixed amount of disk space designated as swap space. If your program and data don’t fit in physical memory, the virtual memory system writes pages from physical memory to disk as necessary and reads in the pages it needs. This is called swapping.
You can find the amount of memory on our systems by following the links on our Supercomputers page. You can see the memory and swap values for a node by running the Linux command
free on the node. As shown below, a standard node on Pitzer has 187GB physical memory and 47GB swap space.
[p0123]$free -mh total used free shared buff/cache available Mem: 187G 8.9G 176G 626M 2.4G 176G Swap: 47G 0B 47G
In the world of high-performance computing, swapping is almost always undesirable. If your program does a lot of swapping, it will spend most of its time doing disk I/O and won’t get much computation done. Therefore, swapping is not supported at OSC. You should consider the suggestions below.
Here are some suggestions for fixing jobs that use too much memory. Feel free to contact OSC Help for assistance with any of these options.
Some of these remedies involve requesting more processors (cores) for your job. As a general rule, we require you to request a number of processors proportional to the amount of memory you require. You need to think in terms of using some fraction of a node rather than treating processors and memory separately. If some of the processors remain idle, that’s not a problem. Memory is just as valuable a resource as processors.
Request whole node or more processors
Jobs requesting less than a whole node are those that for example have nodes=1 with ppn<40 on Pitzer, like
nodes=1:ppn=1. These jobs can be problematic for two reasons. First, they are entitled to use an amount of memory proportional to the ppn value requested; if they use more they interfere with other jobs. Second, if they cause a node to crash, it typically affects multiple jobs and multiple users.
If you’re sure about your memory usage, it’s fine to request just the number of processors you need, as long as it’s enough to cover the amount of memory you need. If you’re not sure, play it safe and request all the processors on the node.
Reduce memory usage
Consider whether your job’s memory usage is reasonable in light of the work it’s doing. The code itself typically doesn’t require much memory, so you need to look mostly at the data size.
If you’re developing the code yourself, look for memory leaks. In MATLAB look for large arrays that can be cleared.
An out-of-core algorithm will typically use disk more efficiently than an in-memory algorithm that relies on swapping. Some third-party software gives you a choice of algorithms or allows you to set a limit on the memory the algorithm will use.
Use more nodes for a parallel job
If you have a parallel job you can get more total memory by requesting more nodes. Depending on the characteristics of your code you may also need to run fewer processes per node.
Here’s an example. Suppose your job on Pitzer includes the following lines:
#PBS -l nodes=2:ppn=40 … mpiexec mycode
This job uses 2 nodes, so it has 2*183=366GB total memory available to it. The
mpiexec command by default runs one process per core, which in this case is 2*40=80 copies of mycode.
If this job uses too much memory you can spread those 80 processes over more nodes. The following lines request 4 nodes, giving you a total of 4*183=732GB total memory. The
-ppn 20 option on the
mpiexec command says to run 20 processes per node instead of 40, for a total of 80 as before.
#PBS -l nodes=4:ppn=40 … mpiexec -ppn 20 mycode
Since parallel jobs are always assigned whole nodes, the following lines will also run 20 processes per node on 4 nodes.
#PBS -l nodes=4:ppn=20 … mpiexec mycode
Request large-memory nodes
Pitzer has four huge memory nodes with 3TB of memory and with 80 cores. Owens has sixteen huge memory nodes with 1.5 TB of memory and with 48 cores.
Since there are so few of these nodes, compared to hundreds of standard nodes, jobs requesting them will often have a long wait in the queue. The wait will be worthwhile, though, If these nodes solve your memory problem.
To use the huge memory nodes on Pitzer, request the whole node (ppn=80).
#PBS -l nodes=1:ppn=80,mem=3000GB
To use the huge-memory node on Owens you request the whole node (
ppn=48) as well.
#PBS -l nodes=1:ppn=48 …
Some knowledge about virtual memory
The sections above are intended to help you get your job running correctly. This section is to provide some general knowledge about virtual memory.
We will use Linux terminology. Each process has several virtual memory values associated with it. VmSize is virtual memory size; VmRSS is resident set size, or physical memory used; VmSwap is swap space used. The number we care about is the total memory used by the process, which is VmRSS + VmSwap. The relationship among VmSize, VmRSS, and VmSwap is: VmSize >= VmRSS+VmSwap. For many programs, this bound is fairly tight; for others VmSize can be much larger than the memory actually used. If the bound is reasonably tight,
-l vmem=4gb provides an effective mechanism for limiting memory usage to 4gb (for example). If the bound is not tight, VmSize may prevent the program from starting even if VmRSS+VmSwap would have been perfectly reasonable. Java and some FORTRAN 77 programs in particular have this problem.
What PBS allows a job to limit is VmRSS (using
-l mem=xxx) or VmSize (using
-l vmem=xxx). The
vmem limit in PBS is for the entire job, not just one node, so it isn’t useful with parallel (multimode) jobs.
vmemlimit option anymore.
Feel free to contact OSC Help for assistance if you would like to use
vmem option, or have other situations that aren’t covered here.
How to monitor your memory usage
While your job is running the command
qstat -f jobid will tell you the peak physical and virtual memory usage of the job so far. For a parallel job, these numbers are the aggregate usage across all nodes of the job. The values reported by qstat may lag the true values by a couple of minutes.
For parallel (multinode) jobs you can check your per-node memory usage while your job is running by using
pdsh -j jobid free -m
You can put the command
ja (job accounting) at the end of your batch script to capture the resource usage reported by
qstat -f. The information will be written to your job output log,
You can also view node status graphically the OSC OnDemand Portal (ondemand.osc.edu). Under "Jobs" select "Active Jobs". Click on "Job Status" and scroll down to see memory usage. This shows the total memory usage for the node; if your job is not the only one running there, it may be hard to interpret.
Below is a typical graph for jobs using too much memory. It shows two jobs that ran back-to-back on the same node. The first peak is a job that used all the available physical memory (blue) and a large amount of swap (purple). It completed successfully without crashing the node. The second job followed the same pattern but actually crashed the node.
If it appears that your job is close to crashing a node, we may preemptively delete the job.
If your job is interfering with other jobs by using more memory than it should be, we may delete the job.
In extreme cases OSC staff may restrict your ability to submit jobs. If you crash a large number of nodes or continue to submit problem jobs after we have notified you of the situation, this may be the only way to protect the system and our other users. If this happens, we will restore your privileges as soon as you demonstrate that you have resolved the problem.
For details on retrieving files from unexpectedly terminated jobs see this FAQ.
OSC has staff available to help you resolve your memory issues. See our Support Services page for contact information.