Supercomputing Networking Research Education Ohio Supercomputer Center Site Map Staff Directory Support

Supercomputing Environments

GPGPU Hardware Information

The Ohio Supercomputer Center (OSC) provides supercomputing services to Ohio colleges, universities, and companies.

There are 36 GPU-capable nodes on Glenn, connected to 18 Quadro Plex S4's for a total of 72 CUDA-enabled graphics devices.Each node has access to two Quadro FX 5800-level graphics cards.

Each Quadro Plex S4 has these specs:
- Each Quadro Plex S4 contains 4 Quadro FX 5800 GPU's.
- 240 cores per GPU
- 4GB Memory per card

The 36 compute nodes in glenn contain:
- Dual socket, quad core 2.5 GHz Opterons
- 24 GB RAM
- 393 local disk space in '/tmp'
- 20Gb/s Infiniband ConnectX host channel adapater (HCA)

IMPORTANT NOTE:


Update: 7/11/2011

- Updated information on how to request GPU resources. Please read the section on batch requests for more information.

Update: 10/25/2010

- Updated information on how to connect to the compute nodes.  Please read the section on batch requests for more information.
- Updated the default cuda module to use 3.1.

INDEX

Please see the hardware section for current system specifications.

Getting started

This page is meant for users to get familiar with CUDA development in the glenn environment. Many topics are not covered, so some degree of knowledge about our systems is necessary. This is a good starting place if you are not familar with the glenn system.

Note: If you are familiar with CUDA development and don't plan to read through the whole page, please take a look at the important section describing exclusive mode under Coding in CUDA. This is specific to our machines, and is very important to how CUDA applications are run on the glenn cluster.

The CUDA toolkit contains the CUDA Runtime API and CUDA Driver API libraries needed to run a CUDA application. The CUDA Runtime API is used in most of the examples. The Driver API is a low-level interface.

The CUDA SDK is a collection of example programs illustrating various aspects of CUDA and GPGPU usage. It includes some utilities that are intended primarily for use with the examples. Use of the SDK in production code is not recommended. Users who want to use the SDK should install it in their home directories. (See instructions below).

Documentation for CUDA can be found here.

Batch requests

Batch requests can be made in both interactive and batch sessions. To request a node with GPU capabilities, add the following option to your node options in your PBS submission:

Option Meaning

-l nodes=N:ppn=P:gpu

N is the number of nodes, P is the number of processors. P MUST be 8.

Here is what an example PBS file will look like to request an entire node:

 #PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=8:gpu
#PBS -N compute
#PBS -j oe
#PBS -S /bin/csh
module load cuda
cd $HOME/cuda
cp mycudaApp $TMPDIR
cd $TMPDIR
./mycudaApp

For an interactive session, this is the command to use to request an entire node:

 qsub -I -l nodes=1:ppn=8:gpu -l walltime=12:00:00

Programming environment

Glenn currently supports CUDA for GPGPU computation. Please see the CUDA section below for more information on how to load the module.

CUDA

The newest version of the CUDA toolkit is currently supported. Use the command:

module load cuda

to load CUDA v3.1 into your path.

SDK(s)

Users can download the CUDA SDK themselves, or use the version currently available under the module. To setup the SDK from the module, simply execute:

> sh /usr/local/cuda-3.1/gpucomputingsdk_3.1_linux.run

Press enter for both options. The module has added the correct path to find the toolkit. To build the examples

> cd ~/NVIDIA_GPU_Computing_SDK/C/
> make 

Binaries will be place in 'bin/linux/release'. Most of the demo's will not work because there is no X-display running. The list of examples that will work: deviceQuery, matrixMul, ... The examples that will not work: oceanFFT, simpleGL, ...

Coding in CUDA

To include CUDA directories, you can use a direct path to the include and lib directories from /usr/local/cuda-3.1/cuda, or you can use the environment variable CUDA_INSTALL_PATH. Here is a sample Makefile to help:

CUDA_HOME=${CUDA_INSTALL_PATH}
CUDA_INC=-I${CUDA_HOME}/include
CUDA_LIB=-L${CUDA_HOME}/lib64 -lcudart
CUDA_CC=nvcc
CUDA_FLAGS=
  
It is not recommended to use other compilers. Once proper support is provided through the compiler, then more options will be available (for example, PGI's compiler).
$(CUDA_CC) $(CUDA_FLAGS) $(CUDA_INC) -o [cuda.cu.obj] [cuda.cu]
Important: The devices are configured in exclusive mode. This means that 'cudaSetDevice' should NOT be used if requesting one GPU resource. Once the first call to CUDA is executed, the system will figure out which device it is using.

If both cards per node is in use by a single application, please use 'cudaSetDevice'.

Debugging

There are currently no graphical tools to aid in debugging a CUDA kernel at this time. The current method for debugging a CUDA kernel is to use device emulation in which the CUDA thread is run on the CPU. For application debugging, please take a look here

cuda-gdb is available to anyone that loads the cuda module. Please see here for information on how to use the debugger.

Performance

cudaprof is available for performance analysis of a CUDA kernel. For more information on CPU performance, please take a look at our Performance Analysis page.
Programming Hints

CUDA development can also be combined with other parallel coding techniques. These examples will run on glenn and should provide insight to how CUDA can be combined. Please download the link for the examples multiLevel.tar or multiLevel.zip.

In order to compile these examples, you will need to install the CUDA SDK, and modify the Makefile to point to the correct SDK path. The variable that needs to be modified is SDKDIR towards the top of the file. Also, please ensure that the correct mpi module is loaded. Use this command, 'module switch mpi mvapich2-1.2p1-gnu' to switch the module. Now you can type 'make' to build the examples.

    CUDA + MPI This example shows how a user would call CUDA in a MPI application. With this setup, users can run multiple MPI instances across multiple machines, or multiple MPI threads on the same machine. For CUDA purposes, please remember that these nodes are running in exclusive mode. You will only want two threads accessing the video card at once.

    CUDA + OpenMP This example shows how a user would call CUDA with OpenMP. With this setup, users will run multiple OpenMP threads on a node. Since the gpu's are in exclusive mode, each node used can only have two threads accessing a video card at a time.

    CUDA + MPI + OpenMP For this example, we will combine both MPI, CUDA, and OpenMP into a single application. The recommended pattern to use when developing a hybrid approach like this is to use MPI to distribute to the nodes, and OpenMP to thread on the nodes. And just like the other two models, you will be limited to two threads running per node for the GPU's.



Software
Training