cp2k/2023.2 can produce huge output containing MKL messages
On all clusters the cp2k executables from module cp2k/2023.2 can produce huge output files due to many many repeating errors from MKL, e.g.:
On all clusters the cp2k executables from module cp2k/2023.2 can produce huge output files due to many many repeating errors from MKL, e.g.:
MKL module files define some helper environment variables with incorrect paths. This can yield link time errors. All three clusters are affected. We are working to correct the module files. A workaround for users is to redefine the environment variable with the correct path; this requires some computational maturity. We recommend users contact oschelp@osc.edu for assistance. An example error from Cardinal with module intel-oneapi-mkl/2023.2.0 that defined environment variable MKL_LIBS_INT64 follows:
Users may encounter the following message and experience NCCL hangs if the first operation is a barrier when running multi-GPU training. We have identified that this issue occurs only on a single Ascend Next Gen (dual-GPU) node where the GPUs are connected via the SMP interconnect across NUMA nodes, rather than through NVLink.
PyTorch can hang on Ascend on dual-GPU nodes
Through internal testing, we have confirmed that the hang issue only occurs on Ascend dual-GPU (nextgen) nodes. We’re still unsure why setting NCCL_P2P_DISABLE is necessary even for single-node setups. However, the performance impact should be minimal. This is because, on dual-GPU nodes, the GPUs are connected via the SMP interconnect across NUMA nodes. As there’s no NVLink on these nodes, GPU communication occurs through shared memory
To get around this set the environment variable NCCL_P2P_DISABLE=1
A pure MPI application using mpirun
or mpiexec
with more ranks than the number of NUMA nodes may encounter an error similar to the following:
When running a full-node MPI job with MVAPICH 3.0 , you may encounter the following warning message:
When running MPI+OpenMP hybrid code with the Intel Classic Compiler and MVAPICH 3.0, you may encounter the following warning message from hwloc: