The Owens and Pitzer clusters have access to the DDN Infinite Memory Engine (IME), a fast data tier between the compute nodes and the /fs/scratch
file system. IME is a Solid State Disk (SSD) that can act as a cache and burst buffer to improve the performance of the scratch file system.
While some jobs will benefit from using IME, others will not. You need to understand the workflow of IME well, since sometimes it is not very intuitive. Files must be explicitly imported, synchronized and/or purged.
When to Use IME
Benefits of Using IME
The most obvious reason for using IME is to speed up the I/O in your job. Jobs with heavy I/O may benefit from IME.
Another reason to use IME is to reduce the I/O load on the scratch file system. A job that writes a large amount of data to /fs/scratch
in a short period of time may overload the servers and degrade performance for other users. Even if IME doesn’t improve performance for that job, it may have a system-wide benefit.
Limitations of IME
IME should not be used if many small files are being written or if files are opened and closed frequently. Metadata-intensive operations such as these are likely to be slower on IME than on the underlying file system.
We recommand to test properly before using IME for production. If data is not managed properly (see below) it’s possible to lose data or end up with corrupted files.
The IME system at OSC is still somewhat new and may not be as stable as the standard file systems.
IME Concepts
IME is a user-managed cache available on Owens and Pitzer scratch file systems. Data in /fs/scratch is not automatically loaded into IME. Similarly, data written to IME is not automatically written to /fs/scratch
. Here we define some terminology as a means of introducing important concepts.
resident
File data is resident in IME if it is present in the IME layer. It may or may not also exist in /fs/scratch
. Data is made resident either by explicitly importing it with the IME command-line utility ime-ctl
or by directly writing data to the IME.
clean
Data resident in IME is clean if it has not been modified since it was last imported or synchronized. Clean data matches the corresponding file data in /fs/scratch
unless the file was modified in /fs/scratch
. A file typically consists of many fragments, or chunks, some of which may be clean and others dirty.
dirty
Data resident in IME is dirty if it has been written or modified and not synchronized. Dirty data are not visible on /fs/scratch
and may also not be visible from other IME clients.
import
A file in /fs/scratch
may be imported into IME using the command-line utility ime-ctl
. The file is then resident in IME; the file in /fs/scratch
is unchanged. Reading a file through the IME interface without explicitly importing it does not import it, or make it resident in IME, and is typically slower than reading it directly from /fs/scratch
.
synchronize
Synchronization is the process of writing dirty file data from IME to /fs/scratch
. The user must explicitly synchronize an IME-resident file, using the command-line utility ime-ctl
, to make it visible on /fs/scratch
. There is no automatic synchronization. Exception: If IME gets too full, the system will synchronize and purge older resident data to make space for new data.
purge
File data resident in IME remains resident until it is purged. Files should be purged when they are no longer needed in IME, typically at the end of a job. Because IME is small (40TB) compared to the size of /fs/scratch
, all users are asked to purge their data from IME promptly unless there is a reason for keeping it resident. Purging a file from IME does not affect the corresponding file in /fs/scratch
. Unsynchronized data will be lost if it is purged. The command-line utility ime-ctl
is used to purge data. The system may purge resident data if IME gets too full; dirty data is synchronized before purging.
FUSE
The FUSE (Filesystem in Userspace) interface is a POSIX mount point at /ime/scratch
. The /ime/scratch
directory is available only through batch jobs that explicitly request it (see below); it is not mounted on the login nodes. All files in /fs/scratch
are visible in /ime/scratch
, even when they are not resident in IME.
Relationship between IME and /fs/scratch
The interaction between IME and /fs/scratch
is complex and sometimes not intuitive. Here are a few points to keep in mind.
- All files and data on
/fs/scratch
are visible from IME even if they are not resident in IME. - All files in IME are visible from
/fs/scratch
although data is not available unless it is synchronized. File size may not be correct, showing as 0 for unsynchronized files created and written in IME. - File permissions are the same in
/ime/scratch
and/fs/scratch
. If a read-only file is created in IME, it can't be synchronized because the file in scratch can't be written. If a read-only file is imported into IME, it can't be purged from IME. - If a file is changed in
/fs/scratch
while it is resident in IME, results are unpredictable. Data loss or corruption is likely. Purge the file from IME before modifying in/fs/scratch
. - Although files are not automatically synchronized or purged, the system may take these actions without warning if it becomes necessary to free up space in IME. Dirty data is synchronized before it is purged in this situation.
- Recall that IME is much smaller than
/fs/scratch
with capacity of 40TB.
How to Access IME
IME is available at OSC only through batch jobs, not on the login nodes.
There are three ways to access IME. We consider them in order of increasing programming effort and performance.
/ime/scratch FUSE filesystem
The easiest way to access IME is to use the directory /ime/scratch
, a POSIX mount point known as the FUSE (Filesystem in Userspace) interface.
The /ime/scratch
directory is mounted on a compute node only when a user job explicitly requests it by adding the :ime
specification to a node request:
#PBS -l nodes=1:ppn=40:ime
All files in /fs/scratch
will appear under /ime/scratch
, but that doesn't mean the data is resident in the IME. Accessing files on /fs/scratch
through the /ime/scratch
directory is slower than accessing them directly.
See the Use Cases below for examples on how to access IME through the FUSE interface.
$IMEDIR environment variable
Every job that requests IME has an environment variable $IMEDIR
pointing to the IME counterpart to the $PFSDIR
job-temporary scratch directory. That is, if $PFSDIR
is /fs/scratch/MYGROUP/myname/jobid
, then $IMEDIR
is /ime/scratch/MYGROUP/myname/jobid
. Since $PFSDIR
is deleted at the end of the job, $IMEDIR
goes away also.
MPI-IO API (or libraries built on this API)
Some of our MPI installations are built using an IME-specific version of ROMIO to provide high performance MPI-IO operations on IME. You must build and run your application using one of the special MPI modules to take advantage of this feature. This is not currently available at OSC.
IME Command Line Utility
The ime-ctl
command is used to import, synchronize, and purge IME file data as well as check status. It works through the FUSE interface, /ime/scratch
. You can include these commands in a batch job with the :ime
specification on the nodes=
line.
To manage your IME file data outside of a batch job you should use an interactive batch job:
qsub -I -l nodes=1:ppn=1:ime
Hint: Optionally add -q debug
to the qsub
line to use the debug queue (jobs with walltime of 1 hour or less).
Following are some useful ime-ctl
options. In all cases the file or directory name may be specified as either an absolute or a relative path and must be located in the /ime/scratch
directory.
Add -R
to make an operation recursive: ime-ctl -p -R mydirectory
help
ime-ctl -h
Displays help and usage message.
import
ime-ctl -i /ime/scratch/filename
Import the entire file, first purging any clean data already resident.
ime-ctl -i -K /ime/scratch/filename
Import the file data but keep existing clean data during the import operation.
Note: The import operation by default is nonblocking. Add -b
to block on import completion.
ime-ctl -i -b /ime/scratch/filename
synchronize
ime-ctl -r /ime/scratch/filename
Synchronize the file data from IME to /fs/scratch
, keeping the data in IME.
Note: The synchronize operation by default is nonblocking. Add -b
to block on synchronize completion.
ime-ctl -r -b /ime/scratch/filename
purge
ime-ctl -p /ime/scratch/filename
Purge the file data from IME. Unsynchronized data will be lost. The file on /fs/scratch
is not affected.
show fragment status
ime-ctl -s /ime/scratch/filename
Show fragment status of file data resident in IME. Fragments can be in one of four states: dirty, pending (in the process of synchronizing), clean, deletable (in the process of being purged).
Use Cases
This section gives detailed command sequences for some common situations where IME may be used. Examples are for Pitzer but are easily adapted to Owens by changing ppn=40
to ppn=28
. The examples strongly depend on the software it uses. So, you need to undertand the I/O structure of the software.
Temporary files
Temporary files are those that are written and read back in but discarded at the end of a job. In this example we use the FUSE interface with the job-specific $IMEDIR
directory. Files are written and read but not synchronized. At the end of the job $IMEDIR
is automatically removed. This example was tested on Owens.
# Serial job that uses $IMEDIR for temporary files (read/write) #PBS -N temporary_to_ime #PBS -l nodes=1:ppn=40:ime #PBS -l walltime=1:00:00 #PBS -A {MYACCT} # put your primery account group, like PAS1234 module load bwa # Change to job-specific IME temporary directory # $IMEDIR is automatically deleted at end of job cd $PBS_O_WORKDIR cp upstream1000.fa $IMEDIR cd $IMEDIR # Run program, assuming temporary files will be written to current directory bwa index upstream1000.fa # check status ls -l ime-ctl -s $IMEDIR/upstream1000.fa.bwt # Note: No need to purge temporary files because directory will be deleted at end of job
ime-ctl -s
will display the file's status, for example
File: `/ime/scratch/{your file location}/4239XXX.owens-batch.ten.osc.edu/upstream1000.fa.bwt' Number of bytes: Dirty: 43886080 Clean: 0 Syncing: 0
As you see, the file is completely "Dirty", so you know that it is only on IME side.
Output files to be kept
In this example an output file is written to IME. At the end of the job the file is synchronized and then purged from IME, remaining on the scratch file system.
# Serial job that writes output files to IME #PBS -N output_to_ime #PBS -l nodes=1:ppn=28:ime #PBS -l walltime=1:00:00 #PBS -A {MYACCT} # put your primery account group, like PAS1234 module load bwa cd $PBS_O_WORKDIR # create working directory under IME, and copy input files. # The input files can be read from regular file system as well. export IME_WORKDIR=/ime/scratch/{your file location} mkdir $IME_WORKDIR cp upstream1000.fa $IME_WORKDIR cd $IME_WORKDIR # Run program, assuming output will be written to current directory (IME directory) bwa index upstream1000.fa # Wait for synchronization to complete (blocking) ime-ctl -b -R -r $IME_WORKDIR # checking status ls -l ime-ctl -s $IME_WORKDIR/upstream1000.fa.bwt
Large read-only files used by multiple jobs
Some workflows involve large input files that are used by many jobs but never modified. This example keeps the input file resident in IME, re-importing it as necessary. If the file is ever changed it must be manually purged or completely reloaded; this is not part of the job workflow.
#Job with large read-only input file used by multiple jobs #PBS -N read_only_to_ime #PBS -l nodes=1:ppn=28:ime #PBS -l walltime=1:00:00 #PBS -A {MYACCT} # put your primery account group, like PAS1234 module load bwa cd $PBS_O_WORKDIR # We assume the input file is located in /fs/scratch, then we import to IME export IME_WORKDIR=/ime/scratch/{your file location} export INPUTFILE=/fs/scratch/{your file location}/upstream1000.fa # Get IME FUSE path for input file (changes /fs to /ime) export IME_INPUTFILE=$(echo $INPUTFILE | sed 's/^\/fs/\/ime/') # import to IME ime-ctl -b -i $IME_INPUTFILE cd $IME_WORKDIR # Run program bwa index upstream1000.fa # check status ls -l ime-ctl -s $IME_WORKDIR/upstream1000.fa.bwt
Checkpoint files
Checkpoint files are written by a program to allow restart in case the program is terminated before completion. If the program completes the checkpoint file is discarded. Checkpoint files should not be written to $IMEDIR
because they need to persist beyond the end of the job. In this example the checkpoint files are left in IME if the script does not complete and must be manually recovered or purged, or used directly from IME by a subsequent job.
# Job that writes checkpoint files to IME #PBS -N checkpoint_to_ime #PBS -l nodes=1:ppn=40:ime #PBS -l walltime=1:00:00 #PBS -A {MYACCT} # put your primery account group, like PAS1234 module load qchem/5.1.1-openmp module list export CKPTDIR=/fs/scratch/{your file location}/ckptdir mkdir $CKPTDIR # Get IME FUSE path for checkpoint directory (changes /fs to /ime) export IME_CKPTDIR=$(echo $CKPTDIR | sed 's/^\/fs/\/ime/') # Set checkpoint path in Q-Chem export QCSCRATCH=$IME_CKPTDIR # Run program, writing checkpoint files to $IME_CKPTDIR cd $PBS_O_WORKDIR qchem -save -nt $PBS_NP HF_water.in HF_water.out HF_water # If program completed successfully delete checkpoint files retVal=$? if [ $retVal -eq 0 ]; then rm -r $CKPTDIR fi exit $retVal # Note: If program did not complete successfully or job was killed, checkpoint # files will remain in IME and can be synchronized and purged manually. # Or they can be used directly from IME by a subsequent job.