On September 22nd OSC will be switching to Slurm for job scheduling and resource management on the Pitzer Cluster, along with the deployment of the new Pitzer hardware. We are in the process of updating the example job scripts for each software. If a Slurm example is not yet available, please consult our general Slurm information page or contact OSC help.


"Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use. The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster."

Quote from Horovod Github documentation


Please follow the link for general instruction on installing Horovod to use with GPU.

Step 1: Install NCCL 2

Please download NCCL 2 from https://developer.nvidia.com/nccl. (please select O/S agnostic local installer )

Add the nccl library path to LD_LIBRARY_PATH environment variable

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:Path_to_nccl/nccl-<version>/lib
Step 2: Install horovod python package
module load python/3.6-conda5.2

Create a local python environment for horovod installation with nccl and activate it

conda create -n horovod-withnccl python=3.6 anaconda
source activate horovod-withnccl

Install gpu version of tensorflow or pytorch

pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.10.0-cp36-cp36m-linux_x86_64.whl

Load mavapich2 and cuda modules

module load mvapich2/2.3rc2-gpu 

module load cuda/9.1.85

Set the horovod environmental variables

export HOROVOD_CUDA_HOME=/usr/local/cuda/9.1.85

export HOROVOD_CUDA_INCLUDE=/usr/local/cuda/9.1.85/include/

export HOROVOD_CUDA_LIB=/usr/local/cuda/9.1.85/lib

install horovod python package

HOROVOD_NCCL_HOME=/path_to_nccl_home/ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod


Please see benchmark script here

#PBS -N TensorFlow
#PBS -l nodes=2:ppn=28:gpus=1:default
#PBS -l walltime=00:30:00
#PBS -j oe
#PBS -A projectID 
#PBS -S /bin/bash

module load python/3.6-conda5.2
module load cuda/9.1.85
source activate horovod-withnccl
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path_to_nccl_home/lib
module load cuda/9.1.85
ml mvapich2/2.3rc2-gpu
mpiexec -ppn 1 -binding none -env  NCCL_DEBUG=INFO python tf_cnn_benchmarks.py.py 

Feel free to contact OSC Help if you have any issues with installation.

Publisher/Vendor/Repository and License Type

https://eng.uber.com/horovod/, Open source

Further Reading

TensorFlow homepage

Fields of Science: