Some MVAPICH2 MPI installations on Oakley, Ruby, and Owens, such as the default module mvapich2/2.2 as well as mvapich2/2.1, appear to have a bug that is triggered by certain programs. The symptoms are 1) the program hangs or 2) the program fails with an error related to Allreduce or Bcast.
To test whether a failure is related to this issue, as opposed to an error in the application software, set the following environment variable in the batch job: MV2_USE_SLOT_SHMEM_COLL=0 (this option disables optimizations). If the program runs correctly then the failure is in the MVAPICH2 library.
This issue may affect system installed software, such as lammps/31Mar17, but the occurrence seems to be rare.
There are several workarounds to choose from.
1) Keep the MV2_USE_SLOT_SHMEM_COLL=0 flag. This may slow down your code, but it's easy.
2) Switch to mvapich2/1.9 and rebuild your code. You'll also have to move to an older compiler. The easiest way to make the change is with "module load modules/au2014".
3) If you're using Intel compilers, switch to IntelMPI, "module load intelmpi". If you use fftw3 and/or scalapack, you should use the MKL versions of these libraries, not the separately loaded modules. Contact email@example.com for assistance. Some other libraries may not be available.
4) Switch to OpenMPI. We have OpenMPI 1.10 installed on Oakley.
The mvapich2 versions with the bug are out-dated, and it is not available on our clusters anymore.