Ascend

Some MKL environment variables have incorrect paths

MKL module files define some helper environment variables with incorrect paths.  This can yield link time errors.  All three clusters are affected.  We are working to correct the module files.   A workaround for users is to redefine the environment variable with the correct path; this requires some computational maturity.  We recommend users contact oschelp@osc.edu for assistance.  An example error from Cardinal with module intel-oneapi-mkl/2023.2.0 that defined environment variable MKL_LIBS_INT64 follows:

PyTorch hangs on dual-gpu node on Ascend

PyTorch can hang on Ascend on dual-GPU nodes

Through internal testing, we have confirmed that the hang issue only occurs on Ascend dual-GPU (nextgen) nodes. We’re still unsure why setting NCCL_P2P_DISABLE is necessary even for single-node setups. However, the performance impact should be minimal. This is because, on dual-GPU nodes, the GPUs are connected via the SMP interconnect across NUMA nodes. As there’s no NVLink on these nodes, GPU communication occurs through shared memory

Workaround

To get around this set the environment variable NCCL_P2P_DISABLE=1