Supercomputing |
Supercomputing EnvironmentsUsing Glenn, the IBM Opteron Cluster at OSCThe Ohio Supercomputer Center (OSC) provides supercomputing services to Ohio colleges, universities, and companies. The Ohio Supercomputer Center's IBM Cluster 1350, named "Glenn", includes AMD Opteron multi-core technologies and the new IBM cell processors. The system offers a peak performance of more than 22 trillion floating point operations per second and a variety of memory and processor configurations. OSC's new supercomputer also includes blade systems based on the Cell Broadband Engine processor. This will allow Ohio researchers and industries to easily use this new hybrid HPC architecture. Phase One Glenn has been decommissioned December 14, 2011.Please follow HPC Notices on Twitter - They are specifically designed to keep OSC clients up to date on system downtimes, outages, maintenance and software updates. You can can also see a full list of current system notices here.Current Glenn System Specifications:
There are 36 GPU-capable nodes on Glenn, connected to 18 Quadro Plex S4's for a total of 72 CUDA-enabled graphics devices.Each node has access to two Quadro FX 5800-level graphics cards.
When testing these nodes it may be beneficial to use the newer version of the PGI compilers. The quad core systems support the SSE4a opcodes, which supports wider SIMD instructions for floating point operations. This doubles the number of FP operations that can be retired per clock cycle from 2 to 4. To change compiler versions, use the command: module switch pgi pgi-9.0-1 INDEXPlease see the hardware section for current system specifications. Getting startedTo login to Glenn at OSC, ssh to the following hostname: glenn.osc.edu From there, you have access to the compiling systems, performance-analysis tools, and debugging tools. You can run programs interactively or through batch requests. See the following sections for details. File systemGlenn accesses the user home directories found on the OSC mass storage environment. Therefore, users have the same home directory on Glenn as on the Itanium 2 cluster. The system also has fast local disk space intended for temporary files. You are encouraged to perform the majority of your work in the temporary space and only store permanent files in your home directory. To ensure fast access to required files, copy the files to the temporary area at the start of your session. The following example shows how to use /tmp, the temporary directory.
Use this procedure when compiling and executing interactively. The temporary space is not backed up, and old files may be purged when the temporary file system gets full. A simpler procedure is available for batch jobs through the TMPDIR environment variable. See "Batch requests" for more information. There are times when $TMPDIR has insufficient resources. After system requirements use some of the hard drive, e.g. for swap space, anywhere from 45 GB to 1.8 TB of local temporary disk space is available on each node. Jobs which require significant amounts of temporary disk space (>10 GB) in $TMPDIR should specify that using the PBS -l disk=amount directive described below. Any job requiring either more than 1.8 TB of temporary space or shared temporary space should use the /fs/pvfs parallel file system, which is a high-performance, high-capacity shared temporary space. For more information on parallel file system usage, please consult the web page for PVFS at OSC. Executing programsCommands on Glenn can be executed either interactively or through batch requests. There are fixed usage limits for interactive execution; jobs that take more than the allowed CPU time must be executed using batch requests. Current interactive limits are 2 hours of CPU time and 1 GB of memory. To use the resources of the cluster most efficiently, you are encouraged to use batch requests whenever possible. See "Batch requests" for more information. For information on how to execute an MPI program, see the "MPI" section. To execute a non-MPI program, simply enter the name of the executable. Unless otherwise specified, the number of processors used for a non-MPI parallel program is determined by the operating system at runtime. To control the number of processors, set the environment variable OMP_NUM_THREADS. If the number of available processors (four per node) is less than OMP_NUM_THREADS, then at least one processor will run multiple threads. The following ksh example causes a.out to use 4 processors if they are available. export OMP_NUM_THREADS=4 ./a.out The omp_num_threads() function can be called within a program to determine the number of threads assigned to that program. integer function omp_num_threads (fortran) int omp_num_threads(); (C) Batch requestsBatch requests are handled by the TORQUE resource manager and Moab Scheduler. Use the qsub command to submit a batch request, qstat to view the status of your requests, and qdel to delete unwanted requests. For more information, see the manual pages for each command. The following options are often useful when submitting batch requests. The options may appear on the qsub command line or preceded by #PBS at the beginning of the batch request file.
By default, your batch jobs begin execution in your home directory. This is true even if you submit the job from another directory. To facilitate the use of temporary disk space, a unique temporary directory is automatically created at the beginning of each batch job. This directory is also automatically removed at the end of the job. Therefore, it is critical that all files required for further analysis be copied back to permanent storage in your $HOME area prior to the end of your batch script. You access the directory through the $TMPDIR environment variable. Note that in jobs using more than one node, $TMPDIR is not shared -- each node has its own distinct instance of $TMPDIR. Single-CPU sequential jobs should either set the -l nodes resource limit to 1:ppn=1 or leave it unset entirely. The following is an example of a sequential job which uses $TMPDIR as its working area. #PBS -l walltime=40:00:00 #PBS -l nodes=1:ppn=1 #PBS -N myscience #PBS -j oe #PBS -S /bin/ksh cd $HOME/science cp my_program.f mysci.in $TMPDIR cd $TMPDIR pgf77 -O3 my_program.f -o mysci /usr/bin/time ./mysci > mysci.hist cp mysci.hist mysci.out $HOME/Beowulf/cdnz3d If you have the above request saved in a file named my_request.job (and my_program.f saved in a subdirectory called science/), the following command will submit the request. opt-login01:~> qsub my_request.job 1151787.opt-batch.osc.edu You can use the qstat command to monitor the progress of the resulting batch job. In the above example, the number 1151787 is the job identifier ori jobid. When the job finishes, my_results will appear in the science subdirectory, and the standard output generated by the job will appear in a file called my_job.oN, where N is the jobid. The N differentiates multiple submissions of the same job, for each submission generates a different number. This file will appear in the directory where you executed the qsub command. The directory from where you execute the qsub command can be referenced by the environment variable $PBS_O_WORKDIR from within a PBS batch script only. All batch jobs must set the -l walltime resource limit, as this allows the Moab Scheduler to backfill small, short running jobs in front of larger, longer running jobs. This in turn helps improve turnaround time for all jobs. Sample large memory serial job:
Single-node jobs that request 16 GB or more of memory will be scheduled on the quad-socket large memory nodes. The maximum amount of memory available on a node is 64 GB. Sample large disk serial job:
Single-node jobs that request more than 45 GB of temporary space will be scheduled on the quad-socket nodes. The maximum amount of local disk space available on a node is 1800 GB; jobs in need of more temporary space than that must use the /fs/pvfs parallel file system instead. Estimating Queue TimeTo get an estimate of how long before a job (identified by jobid) starts, use the following command: showstart [jobid] This will query the Moab scheduler for an estimate of the job's start time. Please keep in mind that this is an estimate and may change over time, depending on system load and other factors. Programming environmentGlenn supports two programming models of parallel execution: shared memory on exactly one node, through compiler directives and automatic parallelization; and distributed memory across multiple nodes, through message passing. See the sections below for more information. Compiling systemsFORTRAN 77, Fortran 90, C, and C++ are supported on the IBM Opteron cluster. The IBM Opteron cluster has the Intel and Portland Group suites of optimizing compilers, which tend to generate faster code than that generated by the standard GNU compilers. The following examples produce the Linux executable a.out for each type of source file for the Portland Group and Intel compilers. Options which have been found to produce good performance with many (though not necessarily all) programs are given under "Recommended Options".
For more information on command-line options for each compiling system, see the manual pages (man pgf77, man icpc, etc...). Shared memoryUsers can automatically optimize single-node sequential programs for shared-memory parallel execution using the Portland Group -Mconcur or Intel -parallel compiler option. pgf77 -O2 -Mconcur sample.f pgf90 -O2 -Mconcur sample.f90 pgcc -O2 -Mconcur sample.c pgCC -O2 -Mconcur sample.C ifort -O2 -parallel sample.f ifort -O2 -parallel sample.f90 icc -O2 -parallel sample.c icpc -O2 -parallel sample.C In addition to automatic parallelization, both the Fortran and C/C++ compilers understand the OpenMP set of directives, which give the programmer a finer control over the parallelization. The -mp (Portland Group) and -openmp (Intel) compiler options activate translation of source-level OpenMP directives and pragmas. A sample batch script appears below. The request first copies a Fortran file from a subdirectory of the user's home directory to the temporary space. It then compiles the file for OpenMP threaded execution, runs the executable using 4 threads on 1 node, and copies the results back to the previous subdirectory. Notice that the careful use of full file names allows this request to be submitted safely from any subdirectory. #PBS -l walltime=1:00:00 #PBS -l nodes=1:ppn=4 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe cd $TMPDIR cp $HOME/science/my_program.f . pgf77 -O2 -mp my_program.f export OMP_NUM_PROCS=4 ./a.out > my_results cp my_results $HOME/science Message Passing Interface (MPI)The system uses the MPICH implementation of the Message Passing Interface (MPI), optimized for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. For more information on MPI, see the Training section of the OSC website. Each program file using MPI must include the MPI header file. The following statement must appear near the beginning of each C or Fortran source file, respectively. #include <mpi.h> include 'mpif.h' To compile an MPI program, use the MPI wrapper scripts which invoke the Portland Group or Intel compilers depending on which module is loaded prior to executing the compilation command. The MPI compilers take the same options as the compiler they wrap. Here are some examples which produce an executable named a.out: mpif77 sample.f mpif90 sample.f90 mpicc sample.c mpiCC sample.C Use the mpiexec command to run the resulting executable in a batch job; this command will automatically determine how many processors to use on based on your batch request. mpiexec a.out Here is an example of an MPI job which uses 8 of the Infiniband-equipped nodes on the IBM Opteron cluster: #PBS -l walltime=1:00:00 #PBS -l nodes=8:ppn=4 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe cd $HOME/science mpif77 -O3 mpiprogram.f pbsdcp a.out $TMPDIR cd $TMPDIR mpiexec ./a.out > my_results cp my_results $HOME/science Jobs that request a large number of nodes (for instance nodes > 100) are very difficult to schedule and may sit in the queue for a very long time. In practice it is best to start out requesting node=2:ppn=4 and then increase the number of nodes as you are able to confirm that your code's performance scales up with larger numbers of processors. mpiexec will normally spawn one MPI process per CPU requested in a batch job. However, this behavior may be modified with the -pernode command line options. The -pernode option requests that one MPI process be spawned per node. These options are intended to be used for codes which mix MPI message passing with some form of shared memory programming model, such as OpenMP or POSIX threads. If you wish to use fewer than the assigned number of processors, set the -n option to mpiexec to the required number. Here is an example: #PBS -l walltime=1:00:00 #PBS -l nodes=5:ppn=4 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe ... mpiexec -n 19 a.out # running 19 MPI processes If you wish to run one MPI process on each node for benchmarking or multithreading purposes, you need to continue specifying ppn=4, but add the -pernode option to mpiexec. Here is an example: #PBS -l walltime=1:00:00 #PBS -l nodes=5:ppn=4 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe ... mpiexec -pernode a.out # running 5 MPI processes, one on each node The pbsdcp command used in the example above is a distributed copy command; it copies the listed file or files to the specified destination (the last argument) on each node of the cluster assigned to your job. This is needed when copying files to directories which are not shared between nodes, such as /tmp or $TMPDIR. DebuggingThe GNU debugger gdb is recommended for interactive or post-mortem analysis of sequential programs. To debug a program with gdb, first compile the program with the -g option. pgf77 -g program.f pgf90 -g program.f90 pgcc -g program.c pgCC -g program.C To debug a program interactively, run the debugger on the appropriate executable. gdb a.out To analyze a core file after an unsuccessful execution, run the debugger on the core file and supply the executable that generated the file. gdb a.out core A graphical interface called ddd is also available for gdb. Data Display Debugger (DDD) is a graphical front-end for command line debuggers, like gdb. As with gdb, the program must first be compiled with the -g option as given above. To debug a program interactively, run the debugger on the appropriate executable. ddd a.out Further information and documentation on DDD can be found at http://www.gnu.org/software/ddd. The totalview debugger is designed to run on parallel programs using MPI, OpenMP, or pthreads. The user interacts with totalview via a graphical user interface (GUI). All OSC clusters are designed to run compiled parallel code via the PBS batch system. Using the standard batch submission process a user cannot interact directly with their running program. However PBS also permits running in interactive batch mode. This allows the user to use GUI programs such as totalview to run a parallel code. The resource (memory, CPU) limits for an interactive batch job are the same as the standard batch limits for that user. The following is a sample interactive batch script named mybatchfile: #PBS -j oe #PBS -N totalview #PBS -S /bin/ksh #PBS -l nodes=2:ppn=4 #PBS -l walltime=1:00:00 #PBS -v DISPLAY There is no script section as this is intended to run interactively. The PBS lines are there to request resources. On the command line use qsub to request an interactive shell: >> qsub -I mybatchfile qsub: waiting for job 0.opt-batch.osc.edu to start qsub: job 0.opt-batch.osc.edu ready The same request may also accomplished without a batchfile by typing all the resource requests directly on the command line: >> qsub -I -v DISPLAY -l nodes=2:ppn=4 -l walltime=1:00:00 -j oe -N totalview -S /bin/tcsh Once you have an interactive shell on one of the compute nodes, you can treat this shell like any other shell, except for all the extended environment variables under PBS, like $PBS_O_WORKDIR, $TMPDIR, etc. To invoke totalview you can run mpiexec with the -tv option on your MPI program: [optXXXX]% mpiexec -tv myMPIprogram For more information on using interactive batch see the manual page for qsub. Within totalview, you can set breakpoints and examine variables on a per-process basis. |
