| Ohio Supercomputer Center

NCCL hang on Ascend dual-GPU nodes

Users may encounter the following message and experience NCCL hangs if the first operation is a barrier when running multi-GPU training. We have identified that this issue occurs only on a single Ascend Next Gen (dual-GPU) node where the GPUs are connected via the SMP interconnect across NUMA nodes, rather than through NVLink.

PyTorch hangs on dual-gpu node on Ascend

PyTorch can hang on Ascend on dual-GPU nodes

Through internal testing, we have confirmed that the hang issue only occurs on Ascend dual-GPU (nextgen) nodes. We’re still unsure why setting NCCL_P2P_DISABLE is necessary even for single-node setups. However, the performance impact should be minimal. This is because, on dual-GPU nodes, the GPUs are connected via the SMP interconnect across NUMA nodes. As there’s no NVLink on these nodes, GPU communication occurs through shared memory

Workaround

To get around this set the environment variable NCCL_P2P_DISABLE=1

PyTorch jobs timeout and hanging

We have observed that many PyTorch users frequently encounter random timeouts, which result in the termination of their jobs but leave the process running on the node. Consequently, this situation necessitates a reboot of the node.

Nsight GPU profiler not working due to DCGM conflict

UPDATE (Mar 15, 2023)

After the downtime on Mar. 14, 2023, OSC enabled a new Slurm option --gres=nsight. DCGM will be disabled on the nodes for the job with the Slurm option, and Nsight will function normally.

==================================

We are experiencing an issue with Nsight GPU profiler, which is affected by the GPU monitoring service (DCGM) that we are running.

This causes Nsight to malfunction, and produce error messages:

Performance Regression of GPU Nodes on Ruby

We currently have performance regression of Ruby's GPU nodes. Some of the GPU nodes on Ruby will remain in a power-saving state even after an application starts using them, resulting in performance reduction in some cases. We currently have a reservation on the GPU nodes so that we can do a rolling reboot on them to get them back into a known-good state.

We have opened a bug report with the vendor about this performance regression and how to monitor for it.

Nvidia drivers on Oakley

We upgraded the drivers for the Nvidia GPUs on all of our clusters during the downtime this week. Unfortunately, we are noticing some subtle problems with the GPUs on Oakley. We will be rolling back to an older driver on that cluster; the GPUs will be unavailable until that work is completed, potentially all weekend.

Can not change GPU compute mode on Oakley

Update: The driver version has been updated and the issue has been fixed.

In updating the driver version for Oakley's NVIDIA GPUs the NVML libraries that are used in conjunction with torque, our resource manager, were also updated. Unfortunatelly this update has broken the ability to change the GPUs compute mode from its default exclusive thread mode. Requests to change the GPU mode both through a PBS directive and the nvidia-smi command will both fail.

Search form

GPU

NCCL hang on Ascend dual-GPU nodes

PyTorch hangs on dual-gpu node on Ascend

Workaround

PyTorch jobs timeout and hanging

Nsight GPU profiler not working due to DCGM conflict

Performance Regression of GPU Nodes on Ruby

Nvidia drivers on Oakley

Can not change GPU compute mode on Oakley

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

GPU

Workaround

Upcoming Events

Recent News

Translate

State Government Links

Education Links