Problems with MVAPICH2


Some MVAPICH2 MPI installations on Oakley, Ruby, and Owens, such as the default module mvapich2/2.2 as well as mvapich2/2.1, appear to have a bug that is triggered by certain programs.  The symptoms are 1) the program hangs or 2) the program fails with an error related to Allreduce or Bcast.

To test whether a failure is related to this issue, as opposed to an error in the application software, set the following environment variable in the batch job:  MV2_USE_SLOT_SHMEM_COLL=0  (this option disables optimizations).  If the program runs correctly then the failure is in the MVAPICH2 library.

This issue may affect system installed software, such as lammps/31Mar17, but the occurrence seems to be rare.

There are several workarounds to choose from.

1)  Keep the MV2_USE_SLOT_SHMEM_COLL=0 flag.  This may slow down your code, but it's easy.

2)  Switch to mvapich2/1.9 and rebuild your code.  You'll also have to move to an older compiler.  The easiest way to make the change is with "module load modules/au2014".

3)  If you're using Intel compilers, switch to IntelMPI, "module load intelmpi".  If you use fftw3 and/or scalapack, you should use the MKL versions of these libraries, not the separately loaded modules.  Contact for assistance.  Some other libraries may not be available.

4)  Switch to OpenMPI.  We have OpenMPI 1.10 installed on Oakley.

The mvapich2 versions with the bug are out-dated, and it is not available on our clusters anymore.