HOWTO: IME, Cache & Burst Buffer for Scratch File System

The Owens and Pitzer clusters have access to the DDN Infinite Memory Engine (IME), a fast data tier between the compute nodes and the /fs/scratch file system. IME is a Solid State Disk (SSD) that can act as a cache and burst buffer to improve the performance of the scratch file system.

While some jobs will benefit from using IME, others will not. You need to understand the workflow of IME well, since sometimes it is not very intuitive. Files must be explicitly imported, synchronized and/or purged.

When to Use IME

Benefits of Using IME

The most obvious reason for using IME is to speed up the I/O in your job. Jobs with heavy I/O may benefit from IME.

Another reason to use IME is to reduce the I/O load on the scratch file system. A job that writes a large amount of data to /fs/scratch in a short period of time may overload the servers and degrade performance for other users. Even if IME doesn’t improve performance for that job, it may have a system-wide benefit.

Limitations of IME

IME should not be used if many small files are being written or if files are opened and closed frequently. Metadata-intensive operations such as these are likely to be slower on IME than on the underlying file system.

We recommand to test properly before using IME for production. If data is not managed properly (see below) it’s possible to lose data or end up with corrupted files.

The IME system at OSC is still somewhat new and may not be as stable as the standard file systems.

IME Concepts

IME is a user-managed cache available on Owens and Pitzer scratch file systems. Data in /fs/scratch is not automatically loaded into IME. Similarly, data written to IME is not automatically written to /fs/scratch . Here we define some terminology as a means of introducing important concepts.

resident

File data is resident in IME if it is present in the IME layer. It may or may not also exist in /fs/scratch . Data is made resident either by explicitly importing it with the IME command-line utility ime-ctl or by directly writing data to the IME.

clean

Data resident in IME is clean if it has not been modified since it was last imported or synchronized. Clean data matches the corresponding file data in /fs/scratch unless the file was modified in /fs/scratch . A file typically consists of many fragments, or chunks, some of which may be clean and others dirty.

dirty

Data resident in IME is dirty if it has been written or modified and not synchronized. Dirty data are not visible on /fs/scratch and may also not be visible from other IME clients.

import

A file in /fs/scratch may be imported into IME using the command-line utility ime-ctl . The file is then resident in IME; the file in /fs/scratch is unchanged. Reading a file through the IME interface without explicitly importing it does not import it, or make it resident in IME, and is typically slower than reading it directly from /fs/scratch .

synchronize

Synchronization is the process of writing dirty file data from IME to /fs/scratch . The user must explicitly synchronize an IME-resident file, using the command-line utility ime-ctl , to make it visible on /fs/scratch . There is no automatic synchronization. Exception: If IME gets too full, the system will synchronize and purge older resident data to make space for new data.

purge

File data resident in IME remains resident until it is purged. Files should be purged when they are no longer needed in IME, typically at the end of a job. Because IME is small (40TB) compared to the size of /fs/scratch , all users are asked to purge their data from IME promptly unless there is a reason for keeping it resident. Purging a file from IME does not affect the corresponding file in /fs/scratch . Unsynchronized data will be lost if it is purged. The command-line utility ime-ctl is used to purge data. The system may purge resident data if IME gets too full; dirty data is synchronized before purging.

FUSE

The FUSE (Filesystem in Userspace) interface is a POSIX mount point at /ime/scratch . The /ime/scratch directory is available only through batch jobs that explicitly request it (see below); it is not mounted on the login nodes. All files in /fs/scratch are visible in /ime/scratch , even when they are not resident in IME.

Relationship between IME and /fs/scratch

The interaction between IME and /fs/scratch is complex and sometimes not intuitive. Here are a few points to keep in mind.

  • All files and data on /fs/scratch are visible from IME even if they are not resident in IME.
  • All files in IME are visible from /fs/scratch although data is not available unless it is synchronized. File size may not be correct, showing as 0 for unsynchronized files created and written in IME.
  • File permissions are the same in /ime/scratch and /fs/scratch . If a read-only file is created in IME, it can't be synchronized because the file in scratch can't be written. If a read-only file is imported into IME, it can't be purged from IME.
  • If a file is changed in /fs/scratch while it is resident in IME, results are unpredictable. Data loss or corruption is likely. Purge the file from IME before modifying in /fs/scratch .
  • Although files are not automatically synchronized or purged, the system may take these actions without warning if it becomes necessary to free up space in IME. Dirty data is synchronized before it is purged in this situation.
  • Recall that IME is much smaller than /fs/scratch with capacity of 40TB.
Warning: Never directly modify a file in /fs/scratch while it is resident in IME.

How to Access IME

IME is available at OSC only through batch jobs, not on the login nodes.

There are three ways to access IME. We consider them in order of increasing programming effort and performance.

/ime/scratch FUSE filesystem

The easiest way to access IME is to use the directory /ime/scratch , a POSIX mount point known as the FUSE (Filesystem in Userspace) interface.

The /ime/scratch directory is mounted on a compute node only when a user job explicitly requests it by adding the :ime specification to a node request:

#PBS -l nodes=1:ppn=40:ime

All files in /fs/scratch will appear under /ime/scratch , but that doesn't mean the data is resident in the IME. Accessing files on /fs/scratch through the /ime/scratch directory is slower than accessing them directly.

See the Use Cases below for examples on how to access IME through the FUSE interface.

$IMEDIR environment variable

Every job that requests IME has an environment variable $IMEDIR pointing to the IME counterpart to the $PFSDIR job-temporary scratch directory. That is, if $PFSDIR is /fs/scratch/MYGROUP/myname/jobid , then $IMEDIR is /ime/scratch/MYGROUP/myname/jobid . Since $PFSDIR is deleted at the end of the job, $IMEDIR goes away also.

MPI-IO API (or libraries built on this API)

Some of our MPI installations are built using an IME-specific version of ROMIO to provide high performance MPI-IO operations on IME. You must build and run your application using one of the special MPI modules to take advantage of this feature. This is not currently available at OSC.

IME Command Line Utility

The ime-ctl command is used to import, synchronize, and purge IME file data as well as check status. It works through the FUSE interface, /ime/scratch . You can include these commands in a batch job with the :ime specification on the nodes= line.

To manage your IME file data outside of a batch job you should use an interactive batch job:

qsub -I -l nodes=1:ppn=1:ime

Hint: Optionally add -q debug to the qsub line to use the debug queue (jobs with walltime of 1 hour or less).

Following are some useful ime-ctl options. In all cases the file or directory name may be specified as either an absolute or a relative path and must be located in the /ime/scratch directory.

Add -R to make an operation recursive: ime-ctl -p -R mydirectory

help

ime-ctl -h

Displays help and usage message.

import

ime-ctl -i /ime/scratch/filename

Import the entire file, first purging any clean data already resident.

ime-ctl -i -K /ime/scratch/filename

Import the file data but keep existing clean data during the import operation.

Note: The import operation by default is nonblocking. Add -b to block on import completion.

ime-ctl -i -b /ime/scratch/filename

synchronize

ime-ctl -r /ime/scratch/filename

Synchronize the file data from IME to /fs/scratch , keeping the data in IME.

Note: The synchronize operation by default is nonblocking. Add -b to block on synchronize completion.

ime-ctl -r -b /ime/scratch/filename

purge

ime-ctl -p /ime/scratch/filename

Purge the file data from IME. Unsynchronized data will be lost. The file on /fs/scratch is not affected.

show fragment status

ime-ctl -s /ime/scratch/filename

Show fragment status of file data resident in IME. Fragments can be in one of four states: dirty, pending (in the process of synchronizing), clean, deletable (in the process of being purged).

Use Cases

This section gives detailed command sequences for some common situations where IME may be used. Examples are for Pitzer but are easily adapted to Owens by changing ppn=40 to ppn=28 .  The examples strongly depend on the software it uses. So, you need to undertand the I/O structure of the software.

Temporary files

Temporary files are those that are written and read back in but discarded at the end of a job. In this example we use the FUSE interface with the job-specific $IMEDIR directory. Files are written and read but not synchronized. At the end of the job $IMEDIR is automatically removed. This example was tested on Owens.

# Serial job that uses $IMEDIR for temporary files (read/write)
#PBS -N temporary_to_ime
#PBS -l nodes=1:ppn=40:ime
#PBS -l walltime=1:00:00
#PBS -A {MYACCT} # put your primery account group, like PAS1234
module load bwa

# Change to job-specific IME temporary directory
# $IMEDIR is automatically deleted at end of job
cd $PBS_O_WORKDIR
cp upstream1000.fa $IMEDIR
cd $IMEDIR

# Run program, assuming temporary files will be written to current directory
bwa index upstream1000.fa

# check status
ls -l
ime-ctl -s $IMEDIR/upstream1000.fa.bwt
# Note: No need to purge temporary files because directory will be deleted at end of job

ime-ctl -s will display the file's status, for example

File: `/ime/scratch/{your file location}/4239XXX.owens-batch.ten.osc.edu/upstream1000.fa.bwt'
Number of bytes:
Dirty: 43886080
Clean: 0
Syncing: 0

As you see, the file is completely "Dirty", so you know that it is only on IME side.

Output files to be kept

In this example an output file is written to IME. At the end of the job the file is synchronized and then purged from IME, remaining on the scratch file system.

# Serial job that writes output files to IME
#PBS -N output_to_ime
#PBS -l nodes=1:ppn=28:ime
#PBS -l walltime=1:00:00
#PBS -A {MYACCT} # put your primery account group, like PAS1234

module load bwa
cd $PBS_O_WORKDIR

# create working directory under IME, and copy input files.
# The input files can be read from regular file system as well.
export IME_WORKDIR=/ime/scratch/{your file location}
mkdir $IME_WORKDIR
cp upstream1000.fa $IME_WORKDIR
cd $IME_WORKDIR

# Run program, assuming output will be written to current directory (IME directory)
bwa index upstream1000.fa

# Wait for synchronization to complete (blocking)
ime-ctl -b -R -r $IME_WORKDIR

# checking status
ls -l
ime-ctl -s $IME_WORKDIR/upstream1000.fa.bwt

Large read-only files used by multiple jobs

Some workflows involve large input files that are used by many jobs but never modified. This example keeps the input file resident in IME, re-importing it as necessary. If the file is ever changed it must be manually purged or completely reloaded; this is not part of the job workflow.

#Job with large read-only input file used by multiple jobs
#PBS -N read_only_to_ime
#PBS -l nodes=1:ppn=28:ime
#PBS -l walltime=1:00:00
#PBS -A {MYACCT} # put your primery account group, like PAS1234

module load bwa
cd $PBS_O_WORKDIR

# We assume the input file is located in /fs/scratch, then we import to IME
export IME_WORKDIR=/ime/scratch/{your file location}
export INPUTFILE=/fs/scratch/{your file location}/upstream1000.fa

# Get IME FUSE path for input file (changes /fs to /ime)
export IME_INPUTFILE=$(echo $INPUTFILE | sed 's/^\/fs/\/ime/')

# import to IME
ime-ctl -b -i $IME_INPUTFILE

cd $IME_WORKDIR

# Run program
bwa index upstream1000.fa

# check status
ls -l
ime-ctl -s $IME_WORKDIR/upstream1000.fa.bwt

Checkpoint files

Checkpoint files are written by a program to allow restart in case the program is terminated before completion. If the program completes the checkpoint file is discarded. Checkpoint files should not be written to $IMEDIR because they need to persist beyond the end of the job. In this example the checkpoint files are left in IME if the script does not complete and must be manually recovered or purged, or used directly from IME by a subsequent job.

# Job that writes checkpoint files to IME
#PBS -N checkpoint_to_ime
#PBS -l nodes=1:ppn=40:ime
#PBS -l walltime=1:00:00
#PBS -A {MYACCT} # put your primery account group, like PAS1234

module load qchem/5.1.1-openmp
module list

export CKPTDIR=/fs/scratch/{your file location}/ckptdir
mkdir $CKPTDIR

# Get IME FUSE path for checkpoint directory (changes /fs to /ime)
export IME_CKPTDIR=$(echo $CKPTDIR | sed 's/^\/fs/\/ime/')

# Set checkpoint path in Q-Chem
export QCSCRATCH=$IME_CKPTDIR

# Run program, writing checkpoint files to $IME_CKPTDIR
cd $PBS_O_WORKDIR
qchem -save -nt $PBS_NP  HF_water.in HF_water.out HF_water

# If program completed successfully delete checkpoint files
retVal=$?
if [ $retVal -eq 0 ]; then
    rm -r $CKPTDIR
fi
exit $retVal

# Note: If program did not complete successfully or job was killed, checkpoint
#   files will remain in IME and can be synchronized and purged manually.
#   Or they can be used directly from IME by a subsequent job.

 

Supercomputer: 
Service: