Horovod

  • "Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use. The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster."

Quote from Horovod Github documentation

Installation

Please follow the link for general instruction on installing Horovod to use with GPU.

Step 1: Install NCCL 2

Please download NCCL 2 from https://developer.nvidia.com/nccl.

Add the library path to LD_LIBRARY_PATH environment variable

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:Path_to_nccl/nccl-<version>/lib
Step 2: Install horovod python package
module load python/3.6-conda5.2

Create a local python environment for horovod installation with nccl and activate it

conda create -n horovod-withnccl python=3.6 anaconda
source activate horovod-withnccl

Install gpu version of tensorflow or pytorch

pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.10.0-cp36-cp36m-linux_x86_64.whl

Load mavapich2 and cuda modules

module load mvapich2/2.3rc2-gpu 

module load cuda/9.1.85

Set the horovod environmental variables

export HOROVOD_CUDA_HOME=/usr/local/cuda/9.1.85

export HOROVOD_CUDA_INCLUDE=/usr/local/cuda/9.1.85/include/

export HOROVOD_CUDA_LIB=/usr/local/cuda/9.1.85/lib

install horovod python package

HOROVOD_NCCL_HOME=/path_to_nccl_home/ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

Testing

Please see benchmark script here

#PBS -N TensorFlow
#PBS -l nodes=2:ppn=28:gpus=1:default
#PBS -l walltime=00:30:00
#PBS -j oe
#PBS -A projectID 
#PBS -S /bin/bash

module load python/3.6-conda5.2
module load cuda/9.1.85
source activate horovod-withnccl
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path_to_nccl_home/lib
module load cuda/9.1.85
ml mvapich2/2.3rc2-gpu
mpiexec -ppn 1 -binding none -env  NCCL_DEBUG=INFO python tf_cnn_benchmarks.py.py 

Feel free to contact OSC Help if you have any issues with installation.

Publisher/Vendor/Repository and License Type

https://eng.uber.com/horovod/, Open source

Further Reading

TensorFlow homepage

Supercomputer: 
Service: 
Technologies: 
Fields of Science: