Supercomputing Networking Research Education Ohio Supercomputer Center Site Map Staff Directory Support

Supercomputing Environments

Using the Parallel File System on the OSC Clusters

One of the back-end services provided by OSC's mass storage environment is a parallel file system, intended for use as high-performance, high-capacity shared temporary space. The parallel file system is supported by sixteen dedicated storage nodes, each with two 2.4 GHz Pentium 4 processors, 4 GB of memory, two Fibre Channel interfaces, and Gigabit Ethernet and InfiniBand network interfaces. The aggregate capacity of this parallel file system is approximately 128 terabytes. The software used to create the parallel file system is PVFS2 from Clemson University and Argonne National Laboratory.

Getting Started

The parallel file system is currently accessible from selected nodes on the following systems:

  • Glenn IBM 1350 Opteron cluster (glenn.osc.edu), on 965 nodes
  • Pentium 4 cluster (oscbw.osc.edu), on 112 nodes
  • Itanium 2 cluster (ia64.osc.edu), on 110 nodes
  • BALE cluster (bale-login.ovl.osc.edu), on 70 nodes

On nodes where the parallel file system is accessible, it will be mounted at /fs/pvfs. These nodes will be identified to the PBS batch system by having the node attribute pvfs. Files and directories on the parallel file system can be manipulated as on any other UNIX-style file system, so commands like cd, mkdir, cp, ls and so on will work on the parallel file system.

To access the parallel file system from a batch job, you'll need to tell the batch system you intend to use it by adding a pvfs attribute to your job's nodes= request:

#PBS -l nodes=2:ppn=2:pvfs

In a batch job which requests the pvfs node attribute, there will be an additional environment variable set called $PFSDIR; this is similar to $TMPDIR in that it is a directory that only exists for the duration of the job, but it resides on the parallel file system and is accessible by all the nodes in your job (as opposed to $TMPDIR which is private to each node).

Using the Parallel File System for Serial Jobs

For serial jobs requiring large (>50GB) amounts of scratch space, the parallel file system should be used in place of locally attached temporary space. In these cases, the job should use $PFSDIR instead of $TMPDIR as its working directory. Here is an example:

#PBS -N bigfile
#PBS -j oe
#PBS -l nodes=1:ppn=2:pvfs
#PBS -l walltime=10:00:00
cd myscience
cp input.dat $PFSDIR
cd $PFSDIR
$HOME/myscience/bigfileapp
cp output.dat $HOME/myscience

For serial programs doing block (binary or unformatted) I/O to the parallel file system, transfer rates of up to 60 MB/s have been observed. For character I/O (eg. Fortran formatted I/O or C printf()), transfer rates should be approximately 10-15 MB/s.

Using the Parallel File System for MPI Parallel Jobs

The MPI-2 specification includes a section on parallel I/O, and most MPI implementations (including the MPICH/ch_gm implementation used on OSC's clusters) implements that interface. As a result, MPI programs on OSC's clusters can use the MPI parallel I/O interface (MPI_File_*()) to acheive higher I/O performance. The parallel file system is specifically tuned for this type of use. Here is an example of a parallel job using the parallel file system:

#PBS -N mpi-io
#PBS -j oe
#PBS -l nodes=8:ppn=2:pvfs
#PBS -l walltime=24:00:00
cd $HOME/myscience
pbsdcp parallel-io-app $TMPDIR
cp input.dat $PFSDIR
cd $PFSDIR
mpiexec $TMPDIR/parallel-io-app
cp output.dat $HOME/myscience

Note that in this example, the executable run by the job is stored in $TMPDIR on each node, but the working directory for the program is $PFSDIR. Executables should not be stored on the parallel file system.

Caveats for Using the Parallel File System

Here are a few things to keep in mind when using the paralle file system:

  • The parallel file system is NOT backed up! It is to be used only for temporary storage. Do not keep files on it over the long term without also storing them in your home directory.
  • Do not store executables on the parallel file system! The mmap() support in the PVFS2 file system driver for the Linux kernel has some bugs which only exhibit themselves for executables, and as a result executables stored on the parallel file system will work inconsistently or not at all. Keep program executables in your home directory or $TMPDIR.

Links to More Information

OSC's Science and Technology Support Group has developed a workshop on parallel I/O techniques, including the use of the MPI-2 parallel I/O interface. The MPI-2 parallel I/O interface is also discussed in the PACS Intermediate MPI asynchronous course.

The PVFS2 website has links to several articles about the file system software.