A common problem on our systems is for a user job to run a node out of memory or to use more than its allocated share of memory if the node is shared with other jobs.
If a job exhausts both the physical memory and the swap space on a node, it causes the node to crash. With a parallel job, there may be many nodes that crash. When a node crashes, the systems staff has to manually reboot and clean up the node. If other jobs were running on the same node, the users have to be notified that their jobs failed.