On September 22nd, OSC switched to Slurm for job scheduling and resource management on the Pitzer Cluster, along with the deployment of the new Pitzer hardware. We are in the process of updating the example job scripts for each software. If a Slurm example is not yet available, please consult our general Slurm information page or contact OSC help.

Horovod

"Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use. The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster."

Quote from Horovod Github documentation

Installation

Please follow the link for general instructions on installing Horovod for use with GPUs. The commands below assume a Bourne type shell; if you are using a C type shell then the "source activate" command may not work; in general, you can load all the modules, define any environment variables, and then type "bash" and execute the other commands.

Step 1: Install NCCL 2

Please download NCCL 2 from https://developer.nvidia.com/nccl (select OS agnostic local installer; Download NCCL 2.7.8, for CUDA 10.2, July 24,2020 was used in the latest test of this recipe).

Add the nccl library path to LD_LIBRARY_PATH environment variable

$ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:Path_to_nccl/nccl-<version>/lib
Step 2: Install horovod python package
module load python/3.6-conda5.2

Create a local python environment for a horovod installation with nccl and activate it

conda create -n horovod-withnccl python=3.6 anaconda
source activate horovod-withnccl

Install a GPU version of tensorflow or pytorch

pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.10.0-cp36-cp36m-linux_x86_64.whl

Load mvapich2 and cuda modules

module load gnu/7.3.0  mvapich2-gdr/2.3.4 

module load cuda/10.2.89

Install the horovod python package

HOROVOD_NCCL_HOME=/path_to_nccl_home/ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

Testing

Please get the benchmark script here.

#!/bin/bash
#PBS -N tensorflow
#PBS -l nodes=2:ppn=40:gpus=2
#PBS -l walltime=00:30:00
#PBS -j oe
#PBS -A projectID 

module load python/3.6-conda5.2
module load cuda/10.2.89
module load gnu/7.3.0
module load mvapich2-gdr/2.3.4

source activate horovod-withnccl
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/path_to_nccl_home/lib
mpiexec -ppn 1 -binding none -env  NCCL_DEBUG=INFO python tf_cnn_benchmarks.py

 

Feel free to contact OSC Help if you have any issues with installation.

Publisher/Vendor/Repository and License Type

https://eng.uber.com/horovod/, Open source

Further Reading

TensorFlow homepage

Supercomputer: 
Service: 
Technologies: 
Fields of Science: