OpenMPI-HPCX 4.1.x hangs on writing files on a shared file system

Category: 
Resolution: 
Resolved
Affected Software: 

Your job utilizing openmpi/4.1.x-hpcx (or 4.1.x on Ascend) might hang while writing files on a shared file system. This issue is caused by a bug stemming from the default OMPIO I/O module and UCX library. We have identified ORCA as being affected by this problem. If you are experiencing this issue, please consider the following solutions:

  • Change the I/O module to ROMIO by adding export OMPI_MCA_io=romio321 to your job script.
  • Switch to OpenMPI 5. You can check for available OpenMPI 5 moduless via module spider openmpi/5.