Supercomputing |
Supercomputing EnvironmentsUsing the BALE Cluster at OSCThe Ohio Supercomputer Center (OSC) provides supercomputing services to Ohio colleges, universities, and companies. A high-performance computing system available at OSC is the BALE Cluster. This cluster is not considered a production system because it is used primarily for workshops and special programs. Researchers are welcome to use the cluster, but computations are limited to three days, or, in batch terms, seventy two hours. This is a cluster of commodity servers with a high speed network interconnect. The current configuration is one front-end node and fifty-five compute nodes. Each node has an AMD Athlon 64 X2 dual-core processor, running at 2.2 GHz, with 4 GB of memory, and a 250GB hard drive with 16MB scratch space. All of the nodes are connected using Infiniband, a switched 16 Gbit/s network. Getting startedTo login to the BALE cluster at OSC, ssh to the following hostname: bale-login.ovl.osc.edu From there, you have access to the compiling systems, performance-analysis tools, and debugging tools. You can run programs interactively or through batch requests. See the following sections for details. File systemThe BALE cluster accesses the user home directories found on the OSC mass storage environment. Therefore, users have the same home directory on the BALE cluster as on the IBM Opteron and Itanium 2 clusters. The BALE cluster also has fast local disk space intended for temporary files. You are encouraged to perform the majority of your work in the temporary space and only store permanent files in your home directory. To ensure fast access to required files, copy the files to the temporary area at the start of your session. The following example shows how to use /tmp, the temporary directory.
Use this procedure when compiling and executing interactively. The temporary space is not backed up, and old files may be purged when the temporary file system gets full. A simpler procedure is available for batch jobs through the TMPDIR environment variable. See "Batch requests" for more information. There are times when $TMPDIR has insufficient resources. After system requirements use some of the hard drive, i.e. swap space, approximately 45 Gb of scratch disk space is available on each node. Any output requiring more space should use the /pvfs parallel file system, which is a high-performance, high-capacity temporary space. For more information consult the OSC manual for pvfs. Executing programsCommands on the BALE cluster can be executed either interactively or through batch requests. The BALE cluster has fixed usage limits for interactive execution; jobs that take more than the allowed CPU time must be executed using batch requests. To use the resources of the cluster most efficiently, you are encouraged to use batch requests whenever possible. See "Batch requests" for more information. For information on how to execute an MPI program, see the "MPI" section. To execute a non-MPI program, simply enter the name of the executable. Unless otherwise specified, the number of processors used for a non-MPI parallel program is determined by the operating system at runtime. To control the number of processors, set the environment variable OMP_NUM_THREADS. If the number of available processors (four per node) is less than OMP_NUM_THREADS, then at least one processor will run multiple threads. The following ksh example causes a.out to use 4 processors if they are available. export OMP_NUM_THREADS=4
./a.out
The omp_num_threads() function can be called within a program to determine the number of threads assigned to that program. integer function omp_num_threads (fortran)
int omp_num_threads(); (C)
Batch requestsBatch requests are handled by the TORQUE resource manager and Moab Scheduler. Use the qsub command to submit a batch request, qstat to view the status of your requests, and qdel to delete unwanted requests. For more information, see the manual pages for each command. The following options are often useful when submitting batch requests. The options may appear on the qsub command line or preceded by #PBS at the beginning of the batch-request file.
By default, your batch jobs begin execution in your home directory. This is true even if you submit the job from another directory. To facilitate the use of temporary disk space, a unique temporary directory is automatically created at the beginning of each batch job. This directory is also automatically removed at the end of the job. You access the directory through the TMPDIR environment variable. Note that in jobs using more than one node, $TMPDIR is not shared -- each node has its own distinct instance of $TMPDIR. Single-CPU sequential jobs should either set the -l nodes resource limit to 1:ppn=1 or leave it unset entirely. The following is an example of a sequential job which uses $TMPDIR as its working area. #PBS -l walltime=40:00:00 #PBS -l nodes=1:ppn=1 #PBS -N myscience #PBS -j oe #PBS -S /bin/ksh cd $HOME/science cp my_program.f mysci.in $TMPDIR cd $TMPDIR pgf77 -O3 my_program.f -o mysci /usr/bin/time ./mysci > mysci.hist cp mysci.hist mysci.out $HOME/Beowulf/cdnz3d If you have the above request saved in a file named my_request.job (and my_program.f saved in a subdirectory called science/), the following command will submit the request. qsub my_request.job You can use the qstat command to monitor the progress of the resulting batch job. When the job finishes, my_results will appear in the science subdirectory, and the standard output generated by the job will appear in a file called my_job.oN, where N is a numeric jobid assigned by the batch system. The N differentiates multiple submissions of the same job, for each submission generates a different number. This file will appear in the directory where you executed the qsub command All batch jobs must set the -l walltime resource limit, as this allows the Moab Scheduler to backfill small, short running jobs in front of larger, longer running jobs. This in turn helps improve turnaround time for all jobs. Estimating Queue TimeTo get an estimate of how long before a job (identified by jobid) starts, use the following command: showstart [jobid]This will query the Moab scheduler for an estimate of the job's start time. Please keep in mind that this is an estimate and may change over time, depending on system load and other factors. Programming environmentThe BALE cluster supports two programming models of parallel execution: shared memory on exactly one node, through compiler directives and automatic parallelization; and distributed memory across multiple nodes, through message passing. See the sections below for more information. Compiling systemsFORTRAN 77, Fortran 90, C, and C++ are supported on the BALE cluster. The BALE cluster has the Portland Group suite of optimizing compilers, which tend to generate faster code than that generated by the standard GNU compilers. The following examples produce the Linux executable a.out for each type of source file.
For more information on command-line options for each compiling system, see the manual pages (man pgf77/code>, man pgcc, etc.). Shared memoryThe BALE cluster can automatically optimize single-node sequential programs for shared-memory parallel execution using the -Mconcur compiler option. pgf77 -O2 -Mconcur sample.f
pgf90 -O2 -Mconcur sample.f90
pgcc -O2 -Mconcur sample.c
pgCC -O2 -Mconcur sample.C
In addition to automatic parallelization, both the Fortran and C/C++ compilers understand the OpenMP set of directives, which give the programmer a finer control over the parallelization. The -mp compiler option enables OpenMP support. A sample ksh batch script appears below. The request first copies a Fortran file from a subdirectory of the user's home to the temporary space. It then compiles the file for OpenMP threaded execution, runs the executable using 4 threads on 1 node, and copies the results back to the previous subdirectory. Notice that the careful use of full file names allows this request to be submitted safely from any subdirectory. #PBS -l walltime=1:00:00 #PBS -l nodes=1:ppn=2 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe cd $TMPDIR cp $HOME/science/my_program.f . pgf77 -O2 -mp my_program.f export OMP_NUM_PROCS=4 ./a.out > my_results cp my_results $HOME/science Message Passing Interface (MPI)By default, the BALE cluster at OSC uses the Portland Group MVAPICH implementation of the Message Passing Interface (MPI), optimized for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. For more information on MPI, see the Training section of the OSC website. Each program file using MPI must include the MPI header file. The following statement must appear near the beginning of each C or Fortran source file, respectively. #include <mpi.h>
include 'mpif.h'
To compile an MPI program, use the MPI wrappers around the Portlansd Group compilers. Here are some examples: mpif77 sample.f
mpif90 sample.f90
mpicc sample.c
mpiCC sample.C
Use the mpiexec command to run the resulting executable in a batch job; this command will automatically determine how many processors to run on based on your batch request. mpiexec a.out Here is an example of an MPI job which uses 8 of the Infiniband-equipped nodes on the BALE cluster: #PBS -l walltime=1:00:00 #PBS -l nodes=8:ppn=2 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe cd $HOME/science mpif77 -O3 mpiprogram.f pbsdcp a.out $TMPDIR cd $TMPDIR mpiexec ./a.out > my_results cp my_results $HOME/science Jobs that request a large number of nodes (for instance nodes > 10) are very difficult to schedule and may sit in the queue for a very long time. In practice it is best to start out requesting node=2:ppn=2 and then increase the number of nodes as you are able to confirm that your code's performance scales up with larger numbers of processors. mpiexec will normally spawn one MPI process per CPU requested in a batch job. However, this behavior may be modified with the -pernode command line options. The -pernode option requests that one MPI process be spawned per node. These options are intended to be used for codes which mix MPI message passing with some form of shared memory programming model, such as OpenMP or POSIX threads. If you wish to use fewer than the assigned number of processors, set the -n option to mpiexec to the required number. Here is an example: #PBS -l walltime=1:00:00 #PBS -l nodes=5:ppn=2 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe ... mpiexec -n 9 a.out # running 9 MPI processes If you wish to run one MPI process on each node for benchmarking or multithreading purposes, you need to continue specifying ppn=2, but add the -pernode option to mpiexec. Here is an example: #PBS -l walltime=1:00:00 #PBS -l nodes=5:ppn=2 #PBS -N my_job #PBS -S /bin/ksh #PBS -j oe ... mpiexec -pernode a.out # running 5 MPI processes, one on each node The pbsdcp command used in the example above is a distributed copy command; it copies the listed file or files to the specified destination (the last argument) on each node of the cluster assigned to your job. This is needed when copying files to directories which are not shared between nodes, such as /tmp or $TMPDIR. DebuggingThe GNU debugger gdb is recommended for interactive or post-mortem analysis of sequential programs. To debug a program with gdb, first compile the program with the -g option. pgf77 -g program.f To debug a program interactively, run the debugger on the appropriate executable. gdb a.out To analyze a core file after an unsuccessful execution, run the debugger on the core file and supply the executable that generated the file. gdb a.out core A graphical interface called ddd is also available for gdb. Data Display Debugger (DDD) is a graphical front-end for command line debuggers, like gdb. As with gdb, the program must first be compiled with the -g option as given above. To debug a program interactively, run the debugger on the appropriate executable. ddd a.out Further information and documentation on DDD can be found at http://www.gnu.org/software/ddd. The totalview debugger is designed to run on parallel programs using MPI, OpenMP, or pthreads. The user interacts with totalview via a graphical user interface (GUI). All OSC clusters are designed to run compiled parallel code via the PBS batch system. Using the standard batch submission process a user cannot interact directly with their running program. However PBS also permits running in interactive batch mode. This allows the user to use GUI programs such as totalview to run a parallel code. The resource (memory, CPU) limits for an interactive batch job are the same as the standard batch limits for that user. The following is a sample interactive batch script named mybatchfile: #PBS -j oe #PBS -N totalview #PBS -S /bin/ksh #PBS -l nodes=2:ppn=2 #PBS -l walltime=1:00:00 #PBS -v DISPLAY There is no script section as this is intended to run interactively. The PBS lines are there to request resources. On the command line use qsub to request an interactive shell: >> qsub -I mybatchfile
qsub: waiting for job xxxxx.bale-storage.ovl.osc.edu to start
qsub: job xxxxx.bale-storage.ovl.osc.edu ready
The same request may also accomplished without a batchfile by typing all the resource requests directly on the command line: >> qsub -I -v DISPLAY -l nodes=2:ppn=2 -l walltime=1:00:00 -j oe -N totalview -S /bin/tcsh Once you have an interactive shell on one of the compute nodes, you can treat this shell like any other shell, except for all the extended environment variables under PBS, like $PBS_O_WORKDIR, $TMPDIR, etc. To invoke totalview you can run mpiexec with the -tv option on your MPI program: [baleXX]% mpiexec -tv myMPIprogram For more information on using interactive batch see the manual page for qsub. Within totalview, you can set breakpoints and examine variables on a per-process basis. |
