Supercomputing Networking Research Education Ohio Supercomputer Center Site Map Staff Directory Support

Supercomputing Environments

Using Glenn, the IBM Opteron Cluster at OSC

The Ohio Supercomputer Center (OSC) provides supercomputing services to Ohio colleges, universities, and companies.

The Ohio Supercomputer Center's IBM Cluster 1350, named "Glenn", includes AMD Opteron multi-core technologies and the new IBM cell processors. The system offers a peak performance of more than 22 trillion floating point operations per second and a variety of memory and processor configurations. OSC's new supercomputer also includes blade systems based on the Cell Broadband Engine processor. This will allow Ohio researchers and industries to easily use this new hybrid HPC architecture.

Phase One Glenn has been decommissioned December 14, 2011.

Please follow HPC Notices on Twitter - They are specifically designed to keep OSC clients up to date on system downtimes, outages, maintenance and software updates. You can can also see a full list of current system notices here.

Current Glenn System Specifications:

  • 877 System x3455 compute nodes - Decommissioned December 14, 2011
  • 650 System x3455 compute nodes ("newdual")
    • Dual socket, quad core 2.5 GHz Opterons
    • 24 GB RAM
    • 393 GB local disk space in /tmp
  • 88 System x3755 compute nodes - Decommissioned December 14, 2011
  • 8 System x3755 compute nodes ("newquad")
    • Quad socket, quad core 2.4 GHz Opterons
    • 64 GB RAM
    • 188 GB local disk space in /tmp
  • Voltaire 10 Gbps PCI Express adapter
  • 4 System x3755 login nodes
    • Quad socket 2 dual core 2.6 GHz Opterons
    • 8 GB RAM
  • All connected together by 10 Gbps or 20 Gbps Infiniband

There are 36 GPU-capable nodes on Glenn, connected to 18 Quadro Plex S4's for a total of 72 CUDA-enabled graphics devices.Each node has access to two Quadro FX 5800-level graphics cards.

  • Each Quadro Plex S4 has these specs:
    • Each Quadro Plex S4 contains 4 Quadro FX 5800 GPU's
    • 240 cores per GPU
    • 4GB Memory per card

  • The 36 compute nodes in Glenn contain:
    • Dual socket, quad core 2.5 GHz Opterons
    • 24 GB RAM
    • 393 local disk space in '/tmp'
    • 20Gb/s Infiniband ConnectX host channel adapater (HCA)

 

When testing these nodes it may be beneficial to use the newer version of the PGI compilers. The quad core systems support the SSE4a opcodes, which supports wider SIMD instructions for floating point operations. This doubles the number of FP operations that can be retired per clock cycle from 2 to 4. To change compiler versions, use the command:

module switch pgi pgi-9.0-1

INDEX

Please see the hardware section for current system specifications.

Getting started

To login to Glenn at OSC, ssh to the following hostname:

          glenn.osc.edu    

From there, you have access to the compiling systems, performance-analysis tools, and debugging tools. You can run programs interactively or through batch requests. See the following sections for details.

File system

Glenn accesses the user home directories found on the OSC mass storage environment. Therefore, users have the same home directory on Glenn as on the Itanium 2 cluster.

The system also has fast local disk space intended for temporary files. You are encouraged to perform the majority of your work in the temporary space and only store permanent files in your home directory. To ensure fast access to required files, copy the files to the temporary area at the start of your session.

The following example shows how to use /tmp, the temporary directory.

    mkdir /tmp/$USER Create your own temporary directory.
    cp files /tmp/$USER Copy the necessary files.
    cd /tmp/$USER Move to the directory.
    ... Do work (compile, execute, etc.).
    ...
    cp new files $HOME Copy important new files back home.
    cd $HOME Return to your home directory.
    rm -rf /tmp/$USER Remove your temporary directory.
    exit End the session.

Use this procedure when compiling and executing interactively. The temporary space is not backed up, and old files may be purged when the temporary file system gets full.

A simpler procedure is available for batch jobs through the TMPDIR environment variable. See "Batch requests" for more information.

There are times when $TMPDIR has insufficient resources. After system requirements use some of the hard drive, e.g. for swap space, anywhere from 45 GB to 1.8 TB of local temporary disk space is available on each node. Jobs which require significant amounts of temporary disk space (>10 GB) in $TMPDIR should specify that using the PBS -l disk=amount directive described below. Any job requiring either more than 1.8 TB of temporary space or shared temporary space should use the /fs/pvfs parallel file system, which is a high-performance, high-capacity shared temporary space. For more information on parallel file system usage, please consult the web page for PVFS at OSC.

Executing programs

Commands on Glenn can be executed either interactively or through batch requests. There are fixed usage limits for interactive execution; jobs that take more than the allowed CPU time must be executed using batch requests. Current interactive limits are 2 hours of CPU time and 1 GB of memory. To use the resources of the cluster most efficiently, you are encouraged to use batch requests whenever possible. See "Batch requests" for more information.

For information on how to execute an MPI program, see the "MPI" section.

To execute a non-MPI program, simply enter the name of the executable. Unless otherwise specified, the number of processors used for a non-MPI parallel program is determined by the operating system at runtime. To control the number of processors, set the environment variable OMP_NUM_THREADS. If the number of available processors (four per node) is less than OMP_NUM_THREADS, then at least one processor will run multiple threads.

The following ksh example causes a.out to use 4 processors if they are available.

          export OMP_NUM_THREADS=4            ./a.out    

The omp_num_threads() function can be called within a program to determine the number of threads assigned to that program.

          integer function omp_num_threads (fortran)            int omp_num_threads(); (C)    

Batch requests

Batch requests are handled by the TORQUE resource manager and Moab Scheduler. Use the qsub command to submit a batch request, qstat to view the status of your requests, and qdel to delete unwanted requests. For more information, see the manual pages for each command.

The following options are often useful when submitting batch requests. The options may appear on the qsub command line or preceded by #PBS at the beginning of the batch request file.

Option Meaning
-N job Name the job.
-S shell Use shell rather than your default login shell to interpret the job script.
-l walltime=time Total wallclock time limit in seconds or hours:minutes:seconds
-l nodes=numnodes:ppn=numprocs

Request use of numprocs processors on each of numnodes nodes.

nodes = x:ppn <= 8 # Jobs will run on newdual
nodes = 1:ppn >= 9 # Jobs will run on newquad

You can also explicitly specify which nodes you would like to run on:

-l nodes=x:ppn=y:newdual
-l nodes=x:ppn=y:newquad # y must be >= 9

-l mem=amount (OPTIONAL) Request use of amount of memory per node. Default units are bytes; can also be expressed in megabytes (e.g. mem=1000MB) or gigabytes (eg. mem=2GB). If you need more than 24GB of RAM, you must request 9 or more ppn.
-l file=amount (OPTIONAL) Request use of amount of local scratch disk space per node. Default units are bytes; can also be expressed in megabytes (e.g. file=10000MB) or gigabytes (eg. file=10GB). Only required for jobs using > 10GB of local scratch space per node.
-l software=package[+N] (OPTIONAL) Request use of N licenses for package. If omitted, N=1. Only required for jobs using specific software packages with limited numbers of licenses; see software documentation for details.
-j oe Redirect stderr to stdout.
-m ae Send e-mail when the job finishes or aborts.

By default, your batch jobs begin execution in your home directory. This is true even if you submit the job from another directory.

To facilitate the use of temporary disk space, a unique temporary directory is automatically created at the beginning of each batch job. This directory is also automatically removed at the end of the job. Therefore, it is critical that all files required for further analysis be copied back to permanent storage in your $HOME area prior to the end of your batch script. You access the directory through the $TMPDIR environment variable. Note that in jobs using more than one node, $TMPDIR is not shared -- each node has its own distinct instance of $TMPDIR.

Single-CPU sequential jobs should either set the -l nodes resource limit to 1:ppn=1 or leave it unset entirely. The following is an example of a sequential job which uses $TMPDIR as its working area.

#PBS -l walltime=40:00:00    
#PBS -l nodes=1:ppn=1    
#PBS -N myscience    
#PBS -j oe     
#PBS -S /bin/ksh        
cd $HOME/science    
cp my_program.f mysci.in $TMPDIR    
cd $TMPDIR    
pgf77 -O3 my_program.f -o mysci    
/usr/bin/time ./mysci > mysci.hist    
cp mysci.hist mysci.out $HOME/Beowulf/cdnz3d    

If you have the above request saved in a file named my_request.job (and my_program.f saved in a subdirectory called science/), the following command will submit the request.

opt-login01:~> qsub my_request.job            
1151787.opt-batch.osc.edu    

You can use the qstat command to monitor the progress of the resulting batch job. In the above example, the number 1151787 is the job identifier ori jobid. When the job finishes, my_results will appear in the science subdirectory, and the standard output generated by the job will appear in a file called my_job.oN, where N is the jobid. The N differentiates multiple submissions of the same job, for each submission generates a different number. This file will appear in the directory where you executed the qsub command. The directory from where you execute the qsub command can be referenced by the environment variable $PBS_O_WORKDIR from within a PBS batch script only.

All batch jobs must set the -l walltime resource limit, as this allows the Moab Scheduler to backfill small, short running jobs in front of larger, longer running jobs. This in turn helps improve turnaround time for all jobs.

Sample large memory serial job:

#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l mem=16gb
#PBS -N cdnz3d
#PBS -j oe
#PBS -S /bin/ksh

cd $HOME/Beowulf/cdnz3d
cp cdnz3d cdin.dat acq.dat cdnz3d.in $TMPDIR
cd $TMPDIR
./cdnz3d > cdnz3d.hist
cp cdnz3d.hist cdnz3d.out $HOME/Beowulf/cdnz3d
ja

Single-node jobs that request 16 GB or more of memory will be scheduled on the quad-socket large memory nodes. The maximum amount of memory available on a node is 64 GB.

Sample large disk serial job:

#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l file=96gb
#PBS -N cdnz3d
#PBS -j oe
#PBS -S /bin/ksh

cd $HOME/Beowulf/cdnz3d
cp cdnz3d cdin.dat acq.dat cdnz3d.in $TMPDIR
cd $TMPDIR
./cdnz3d > cdnz3d.hist
cp cdnz3d.hist cdnz3d.out $HOME/Beowulf/cdnz3d
ja

Single-node jobs that request more than 45 GB of temporary space will be scheduled on the quad-socket nodes. The maximum amount of local disk space available on a node is 1800 GB; jobs in need of more temporary space than that must use the /fs/pvfs parallel file system instead.

Estimating Queue Time

To get an estimate of how long before a job (identified by jobid) starts, use the following command:

     showstart [jobid]

This will query the Moab scheduler for an estimate of the job's start time. Please keep in mind that this is an estimate and may change over time, depending on system load and other factors.

Programming environment

Glenn supports two programming models of parallel execution: shared memory on exactly one node, through compiler directives and automatic parallelization; and distributed memory across multiple nodes, through message passing. See the sections below for more information.

Compiling systems

FORTRAN 77, Fortran 90, C, and C++ are supported on the IBM Opteron cluster. The IBM Opteron cluster has the Intel and Portland Group suites of optimizing compilers, which tend to generate faster code than that generated by the standard GNU compilers.

The following examples produce the Linux executable a.out for each type of source file for the Portland Group and Intel compilers. Options which have been found to produce good performance with many (though not necessarily all) programs are given under "Recommended Options".

Language Portland Group Recommended Options Intel Recommended Options
C pgcc sample.c -Xa -tp x64 -fast -Mvect=assoc,cachesize:1048576 icc sample.c -O2 -ansi
C++ pgCC sample.C -A -fast -tp x64 -Mvect=assoc,cachesize:1048576 --prelink-objects icpc sample.C -O2 -ansi
FORTRAN 77 pgf77 sample.f -fast -Mvect=assoc,cachesize:1048576 ifort sample.f -O2
Fortran 90 pgf90 sample.f90 -fast -Mvect=assoc,cachesize:1048576 ifort sample.f90 -O2

For more information on command-line options for each compiling system, see the manual pages (man pgf77,  man icpc, etc...).

Shared memory

Users can automatically optimize single-node sequential programs for shared-memory parallel execution using the Portland Group -Mconcur or Intel -parallel compiler option.

pgf77 -O2 -Mconcur sample.f            
pgf90 -O2 -Mconcur sample.f90            
pgcc -O2 -Mconcur sample.c            
pgCC -O2 -Mconcur sample.C                
ifort -O2 -parallel sample.f            
ifort -O2 -parallel sample.f90            
icc -O2 -parallel sample.c            
icpc -O2 -parallel sample.C            

In addition to automatic parallelization, both the Fortran and C/C++ compilers understand the OpenMP set of directives, which give the programmer a finer control over the parallelization. The -mp (Portland Group) and -openmp (Intel) compiler options activate translation of source-level OpenMP directives and pragmas.

A sample batch script appears below. The request first copies a Fortran file from a subdirectory of the user's home directory to the temporary space. It then compiles the file for OpenMP threaded execution, runs the executable using 4 threads on 1 node, and copies the results back to the previous subdirectory. Notice that the careful use of full file names allows this request to be submitted safely from any subdirectory.

#PBS -l walltime=1:00:00           
#PBS -l nodes=1:ppn=4           
#PBS -N my_job           
#PBS -S /bin/ksh           
#PBS -j oe               
cd $TMPDIR           
cp $HOME/science/my_program.f .           
pgf77 -O2 -mp my_program.f           
export OMP_NUM_PROCS=4           
./a.out > my_results           
cp my_results $HOME/science    

Message Passing Interface (MPI)

The system uses the MPICH implementation of the Message Passing Interface (MPI), optimized for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. For more information on MPI, see the Training section of the OSC website.

Each program file using MPI must include the MPI header file. The following statement must appear near the beginning of each C or Fortran source file, respectively.

#include <mpi.h>           include 'mpif.h'

To compile an MPI program, use the MPI wrapper scripts which invoke the Portland Group or Intel compilers depending on which module is loaded prior to executing the compilation command. The MPI compilers take the same options as the compiler they wrap. Here are some examples which produce an executable named a.out:

mpif77 sample.f                 
mpif90 sample.f90               
mpicc sample.c                
mpiCC sample.C

Use the mpiexec command to run the resulting executable in a batch job; this command will automatically determine how many processors to use on based on your batch request.

mpiexec a.out    

Here is an example of an MPI job which uses 8 of the Infiniband-equipped nodes on the IBM Opteron cluster:

#PBS -l walltime=1:00:00           
#PBS -l nodes=8:ppn=4           
#PBS -N my_job           
#PBS -S /bin/ksh           
#PBS -j oe               
cd $HOME/science           
mpif77 -O3 mpiprogram.f           
pbsdcp a.out $TMPDIR           
cd $TMPDIR           
mpiexec ./a.out > my_results           
cp my_results $HOME/science

Jobs that request a large number of nodes (for instance nodes > 100) are very difficult to schedule and may sit in the queue for a very long time. In practice it is best to start out requesting node=2:ppn=4 and then increase the number of nodes as you are able to confirm that your code's performance scales up with larger numbers of processors.

mpiexec will normally spawn one MPI process per CPU requested in a batch job. However, this behavior may be modified with the -pernode command line options. The -pernode option requests that one MPI process be spawned per node. These options are intended to be used for codes which mix MPI message passing with some form of shared memory programming model, such as OpenMP or POSIX threads.

If you wish to use fewer than the assigned number of processors, set the -n option to mpiexec to the required number. Here is an example:

#PBS -l walltime=1:00:00            
#PBS -l nodes=5:ppn=4            
#PBS -N my_job            
#PBS -S /bin/ksh            
#PBS -j oe                
...            
mpiexec -n 19 a.out            
# running 19 MPI processes    

If you wish to run one MPI process on each node for benchmarking or multithreading purposes, you need to continue specifying ppn=4, but add the -pernode option to mpiexec. Here is an example:

#PBS -l walltime=1:00:00            
#PBS -l nodes=5:ppn=4            
#PBS -N my_job            
#PBS -S /bin/ksh            
#PBS -j oe	                
...		            
mpiexec -pernode a.out             
# running 5 MPI processes, one on each node    

The pbsdcp command used in the example above is a distributed copy command; it copies the listed file or files to the specified destination (the last argument) on each node of the cluster assigned to your job. This is needed when copying files to directories which are not shared between nodes, such as /tmp or $TMPDIR.

Debugging

The GNU debugger gdb is recommended for interactive or post-mortem analysis of sequential programs. To debug a program with gdb, first compile the program with the -g option.

pgf77 -g program.f            
pgf90 -g program.f90            
pgcc -g program.c            
pgCC -g program.C    

To debug a program interactively, run the debugger on the appropriate executable.

gdb a.out    

To analyze a core file after an unsuccessful execution, run the debugger on the core file and supply the executable that generated the file.

gdb a.out core

A graphical interface called ddd is also available for gdb. Data Display Debugger (DDD) is a graphical front-end for command line debuggers, like gdb. As with gdb, the program must first be compiled with the -g option as given above. To debug a program interactively, run the debugger on the appropriate executable.

ddd a.out

Further information and documentation on DDD can be found at http://www.gnu.org/software/ddd.

The totalview debugger is designed to run on parallel programs using MPI, OpenMP, or pthreads. The user interacts with totalview via a graphical user interface (GUI). All OSC clusters are designed to run compiled parallel code via the PBS batch system. Using the standard batch submission process a user cannot interact directly with their running program. However PBS also permits running in interactive batch mode. This allows the user to use GUI programs such as totalview to run a parallel code. The resource (memory, CPU) limits for an interactive batch job are the same as the standard batch limits for that user. The following is a sample interactive batch script named mybatchfile:

#PBS -j oe            
#PBS -N totalview            
#PBS -S /bin/ksh            
#PBS -l nodes=2:ppn=4            
#PBS -l walltime=1:00:00            
#PBS -v DISPLAY    

There is no script section as this is intended to run interactively. The PBS lines are there to request resources. On the command line use qsub to request an interactive shell:

>> qsub -I mybatchfile            
qsub: waiting for job 0.opt-batch.osc.edu to start            
qsub: job 0.opt-batch.osc.edu ready    

The same request may also accomplished without a batchfile by typing all the resource requests directly on the command line:

>> qsub -I -v  DISPLAY -l nodes=2:ppn=4 -l walltime=1:00:00 -j oe -N totalview -S /bin/tcsh    

Once you have an interactive shell on one of the compute nodes, you can treat this shell like any other shell, except for all the extended environment variables under PBS, like $PBS_O_WORKDIR, $TMPDIR, etc. To invoke totalview you can run mpiexec with the -tv option on your MPI program:

[optXXXX]% mpiexec -tv myMPIprogram    

For more information on using interactive batch see the manual page for qsub.

Within totalview, you can set breakpoints and examine variables on a per-process basis.

Performance Analysis
Software
Training