PyTorch

PyTorch is an open source machine learning framework with GPU acceleration and deep neural networks that is based on the automatic differentiation in the Torch library of tensors.

If you installed PyTorch-nightly on Linux via pip between December 25, 2022 and December 30, 2022, please uninstall it and torchtriton immediately, and use the latest nightly binaries (newer than Dec 30th 2022). See this post page from PyTorch for detailed information.

Publisher/Vendor/Repository and License Type

https://pytorch.org, Open source.

Availability and Restrictions

Versions

Pytorch is available on OSC Clusters. The versions currently available at OSC are:

Version	Cardinal	Ascend	Pitzer
2.4.0	X*
2.5.0		X
2.7.1			X

You can use module spider pytorch to view available modules for a given machine.

Loading PyTorch from Module

A basic conda environment with PyTorch is available through the module system:

module load pytorch/2.4.0
module unload pytorch/2.4.0

The basic environment includes: pytorch, transformers, flash attention, accelerate, lightning, deepspeed, diffusers, and megatron. Examples in this documentation use version 2.4.0 but you can replace that with your target version.

Cloning PyTorch Environment

For extending the basic conda with project- or lab-specific packages, we encourage users to clone the basic environment to their project space:

module load miniconda3/24.1.2-py310
conda create --prefix /fs/project/your_project_code/your_username/your_project_name --clone /apps/pytorch/2.4.0

Then, users can install packages in the new cloned conda environment. See HOWTO: Create and Manage Python Environments.

Installing PyTorch Locally

For alternative versions of PyTorch, users are able to create their own conda environments and install locally. We are also available to assist with the configuration of local individual/research-group installations on all our clusters. If you have any questions, please contact OSC Help.

Here is an example installation that was used in February 2022 to install a GPU enabled version compatible with the CUDA drivers on the clusters at that time:

Load the correct python and cuda modules:

module load miniconda3/24.1.2-py310  cuda/12.3.0
module list

Create a python environment to install pytorch into:

conda create -n pytorch

Activate the conda environment:

source activate pytorch

Install the specific version of pytorch:

pip3 install torch torchvision

PyTorch is now installed into your $HOME/local directory using the local install directory hierarchy described here and can be tested via:

module load miniconda3/24.1.2-py310 cuda/12.3.0 ; module list ; source activate pytorch
python <<EOF
import torch
    
x = torch.rand(5, 3)
print("torch.rand(5, 3) =", x)
    
print( "Is cuda available =", torch.cuda.is_available() )
exit
EOF

If testing for a GPU you will need to submit the above script as a batch job (make sure to request a GPU for the job, see Job Scripts for more info on requesting GPU)

Please refer here if you want a different version of the Pytorch.

Batch Usage

Batch jobs can request multiple nodes/cores and compute time up to the limits of the OSC systems. Refer to Queues and Reservations for Owens, and Scheduling Policies and Limits for more info. In particular, Pytorch should be run on a GPU-enabled compute node.

AN EXAMPLE BATCH SCRIPT TEMPLTE

Below is an example batch script (job.sh) for using PyTorch (Slurm syntax).

Contents of job.sh

#!/bin/bash
#SBATCH --job-name=pytorch
#SBATCH --nodes=1 --ntasks-per-node=28 --gpus_per_node=1 --gpu_cmode=shared
#SBATCH --time=30:00
#SBATCH --account=yourprojectID

cd $SLURM_SUBMIT_DIR

module load miniconda3/24.1.2-py310

source activate your-local-python-environment-name

python your-pytorch-script.py

In order to run it via the batch system, submit the job.sh file with the following command:

sbatch job.sh

GPU Usage

GPU Usage: PyTorch can be ran on a GPU for signifcant performace improvements. See HOWTO: Use GPU with Tensorflow and PyTorch
Distributed Training: torch.distributed includes data and model parallelization to increase training speed and overcome GPU memory limitations. See HOWTO: Pytorch Distributed Data Parallel (DDP) and HOWTO: Fully Sharded Data Parallel (FSDP).
Horovod: If you are using PyTorch with a GPU you may want to also consider using Horovod. Horovod will take single-GPU training scripts and scale it to train across many GPUs in parallel.

Jupyter Usage

PyTorch is available to be loaded as a kernel in a Jupyter notebook when running on Pitzer, Cardinal, and Ascend clusters. See HOWTO: Use Jupyter on OnDemand for details. Be sure to request GPU resources when starting your Jupyter session if you want GPU acceleration.

Search form

PyTorch

Publisher/Vendor/Repository and License Type

Availability and Restrictions

Versions

Loading PyTorch from Module

Cloning PyTorch Environment

Installing PyTorch Locally

Create a python environment to install pytorch into:

Activate the conda environment:

Install the specific version of pytorch:

Batch Usage

AN EXAMPLE BATCH SCRIPT TEMPLTE

GPU Usage

Jupyter Usage

Further Reading

Known Issues for PyTorch

Resolved

Client Resources

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

PyTorch

Publisher/Vendor/Repository and License Type

Availability and Restrictions

Versions

Loading PyTorch from Module

Cloning PyTorch Environment

Installing PyTorch Locally

Create a python environment to install pytorch into:

Activate the conda environment:

Install the specific version of pytorch:

Batch Usage

AN EXAMPLE BATCH SCRIPT TEMPLTE

GPU Usage

Jupyter Usage

Further Reading

Known Issues for PyTorch

Resolved

Client Resources

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links