Storage Environment at OSC

OSC has over two petabytes (PB) of disk storage capacity distributed over several file systems, plus almost 2PB of backup tape storage. (A petabyte is 1015, or a quadrillion, bytes.) This guide describes the various storage environments, their characteristics, and their uses.

Service: 

Storage Hardware

The storage at OSC consists of servers, data storage subsystems, and networks providing a number of storage services to OSC HPC systems. The current configuration consists of:

  • NetApp CE5400 storage server
  • Hitachi AMS1000 storage
  • 2 DataDirect Networks 9900 storage
  • local disk storage on each compute node
  • One IBM 3584 tape robot:
    • 16 LTO tape drives
    • 1900 TB (raw capacity) of LTO tapes
  • 18 home directory servers with a total capacity of 360 TB
  • 16 project directory servers with a total capacity of 660 TB
  • 10 GPFS servers with total usable space of 400 TB
  • Lustre file system with 569 TB of usable space, EXAScaler/SFA10K DDN

 

Service: 

File System Usage

OSC has several different file systems where you can create files and directories. The characteristics of those systems and the policies associated with them determine their suitability for any particular purpose. This section describes the characteristics and policies that you should take into consideration in selecting a file system to use.

The various file systems are described in subsequent sections.

Visibility

Most of our file systems are shared. Directories and files on the shared file systems are accessible from all OSC HPC systems. By contrast, local storage is visible only on the node it is located on. Each compute node has a local disk with scratch file space.

Permanence

Some of our storage environments are intended for long-term storage; files are never deleted by the system or OSC staff. Some are intended as scratch space, with files deleted as soon as the associated job exits. Others fall somewhere in between, with expected data lifetimes of a few months to a couple of years.

Backup policies

Some of the file systems are backed up to tape; some are considered temporary storage and are not backed up. Backup schedules differ for different systems.

In no case do we make an absolute guarantee about our ability to recover data. Please read the official OSC data management policies for details. That said, we have never lost backed-up data and have rarely had an accidental loss of non-backed-up data.

Size/Quota

The permanent (backed-up) file systems all have quotas limiting the amount of file space and the number of files that each user or group can use. Your usage and quota information are displayed every time you log in to one of our HPC systems. You can also check them using the quota command. We encourage you to pay attention to these numbers because your file operations, and probably your compute jobs, will fail if you exceed them.

Scratch space on local disks doesn’t have a quota, but it is limited in size. If you have extremely large files, you will have to pay attention to the amount of local file space available on different compute nodes.

Performance

File systems have different performance characteristics including read/write speeds and behavior under heavy load. Performance matters a lot if you have I/O-intensive jobs. Choosing the right file system can have a significant impact on the speed and efficiency of your computations. You should never do heavy I/O in your home or project directories, for example.

Service: 

Available File Systems

Home Directories

Each user ID has a home directory on one of the NFS shared file systems. You have the same home directory regardless of what system you’re on, including all login nodes and all compute nodes, so your files are accessible everywhere. Most of your work in the login environment will be done in your home directory.

OSC currently has 18 home directory file servers. The absolute path to the home directory for user ID usr1234 will have the form /nfs/ nn /usr1234 , where nn is a 2-digit number. The environment variable $HOME is the absolute path to your home directory.

The default permissions on home directories for academic projects allow anyone with an OSC HPC account to read your files, although only you have write permission. You can change the permissions if you want to restrict access. Home directories for accounts on commercial projects are slightly more restrictive, and only allow the owning account and the project group to see the files by default.

Each user has a quota of 500 gigabytes (GB) of storage and 1,000,000 files. This quota cannot be increased. If you have many small files, you may reach the file limit before you reach the storage limit. In this case we encourage you to “ tar ” or “ zip ” your files or directories, creating an archive. If you approach your storage limit, you should delete any unneeded files and consider compressing your files using bzip or gzip . You can archive/unarchive/compress/uncompress your files inside a batch script, using scratch storage that is not subject to quotas, so your files are still conveniently usable. As always, contact OSC Help if you need assistance.

Home directories are considered permanent storage. Accounts that have been inactive for 18 months may be archived, but otherwise there is no automatic deletion of files.

All files in the home directories are backed up daily. Two copies of files in the home directories are written to tape in the tape library.

Access to home directories is relatively slow compared to local or parallel file systems. Batch jobs should not perform heavy I/O in the home directory tree because 1) it will slow down your job and 2) the home directory file servers don’t handle heavy loads gracefully. Instead you should copy your files to fast local storage and run your program there.

Project Directories

For projects that require more than 500GB storage and/or more than 1,000,000 files, additional storage space is available. Principal Investigators should contact OSC Help to request additional storage in the "project" space outside the home directory. Allocations of one to five terabytes are typical. Small allocations can be granted by OSC staff; for large allocations you will have to submit a proposal to the Statewide Users’ Group (SUG).

Project directories are created on the GPFS filesystem. The absolute path to the project directory for project PRJ0123 will have the following form:  /nfs/gpfs/PRJ0123 .

Default permissions on a project directory allow read and write access by all members of the group, with deletion restricted to the file owner. (OSC projects correspond to Linux groups.)

The quota on the project space is shared by all members of the project and corresponds to the allocation that was granted.  It is typically 1-5TB with a limit of 1,000,000 files.

Project space is allocated for a specific period of time, usually one to three years. At the end of that time you may apply for an extension.

All files in the project directories are backed up daily, with a single copy written to tape.

The recommendations for archiving and compressing files are the same for project directories as for home directories.

Comments about access speed and file server load for home directories apply also to project directories. Batch jobs should not perform heavy I/O in a project directory.

Local Disk

Each compute node has a local disk used for scratch storage. This space is not shared with any other system or node.

The batch system creates a temporary directory for each job on each node assigned to the job. The absolute path to this directory is in the environment variable $TMPDIR . The directory exists only for the duration of the job; it is automatically deleted by the batch system when the job ends. Temporary directories are not backed up.

$TMPDIR is a large area where users may execute codes that produce large intermediate files. Local storage has the highest performance of any of the file systems because data does not have to be sent across the network and handled by a file server. Typical usage is to copy input files, and possibly executable files, to $TMPDIR at the beginning of the job and copy output files to permanent storage at the end of the job. See the batch processing documentation for more information.
The size of the temporary file space on each Oakley node is 812GB; on Glenn it is 392GB.. This area is used for spool space for stdout and stderr from batch jobs as well as for $TMPDIR .  If your job requests less than the entire node, you will be sharing this space with other jobs, although each job has a unique directory in $TMPDIR .

Please use $TMPDIR and not /tmp on the compute nodes to ensure proper cleanup.

The login nodes have local scratch space in /tmp. This area is not backed up, and the system removes files last accessed more than 24 hours previously.

Parallel File System

OSC provides a Lustre parallel file system for use as high-performance, high-capacity, shared temporary space. The current capacity of the parallel file system is about 600TB.

The parallel file system is visible from all OSC HPC systems and all compute nodes at /fs/lustre. It can be used as either batch-managed scratch space or as user-managed temporary space. There is no quota on this system.

The Lustre system replaces the PVFS2 system that was previously available at OSC. There is no need for a special flag such as the :pvfs feature that was used in the past.

The batch system creates a scratch directory for each job on the parallel file system. The absolute path to this directory is in the environment variable $PFSDIR . This directory is shared across nodes. It exists only for the duration of the job and is automatically deleted by the batch system when the job ends.

Users may also create their own directories under /fs/lustre. Please name the directory with either your user name or your project ID, for example, /fs/lustre/usr1234 or /fs/lustre/PRJ0123 . This is a good place to store large amounts of temporary data that you need to keep for up to a few months. Files that have not been accessed for some period of time, currently six months, may be deleted. Check OSC’s data management policy for the official deletion schedule. While this system has been extremely reliable, it should be used only for data that you can regenerate or that you have another copy of. It is not backed up.

The parallel file system is a high performance file system that can handle high loads. It should be used by parallel jobs that perform heavy I/O and require a directory that is shared across all nodes. It is also suitable for jobs that require more scratch space than what is available locally. It should be noted that local disk access is faster than any shared file system, so it should be used whenever possible.

The Lustre file system is optimized for reads and writes that are done in large blocks, preferably at least 4MB. If you perform a lot of small operations, performance will be poor. Consequently, using lots of very small files will result in poor performance.

You should not store executables on the parallel file system. Keep program executables in your home or project directory or in $TMPDIR .

Those interested in striping should consult our Parallel Scratch Space Striping Guide.

File Deletion Policy

The parallel file system is temporary storage, and it is not backed up. Data stored on this system is not recoverable if it is lost for any reason, including user error or hardware failure. Data that have not been accessed for more than 180 days will be removed from the system every Wednesday.  

If you need an exemption to the deletion policy, please contact OSC Help including the following information in a timely manner:

  1. Your OSC HPC username
  2. Path of directories/files that need exemption to file deletion
  3. Duration: from MM/DD/YY to MM/DD/YY (The max exemption duration is 180 days)
  4. Detailed justification
Service: 

Parallel Scratch Space Striping Guide

Lustre is a file system known for its capabilities in high performance environments.  A central factor in its ability to achieve such a high perfomance is striping, where individual files are transparently split and stored on multiple underlying targets known as object storage targets (OSTs).  The distributed nature of the file system allows for faster read and write performance than is possible with traditional file systems.

The Parallel Scratch file system, located at /fs/lustre , uses Lustre.

This guide provides a high level overview of Lustre's striping capabilities and basic commands.  Users who have questions or need assistance with striping should contact OSC Help.  Advanced users may want to consult the offical Lustre striping documentation, from which parts of this guide are based upon.

Why Stripe?

Advantages:

  • Increased bandwidth - By distributing a single file amongst multiple OSTs, a higher aggregate I/O bandwidth is possible.  This is particularlly important for situations where large numbers of processes will be doing I/O simultaneously to a single file, as is seen in some parallel programs (MPI I/O).
  • ​Required for large files - Files unable to fit on a single OST must be striped to be distributed amongst multiple OSTs.  The file size limit for a file with no striping depends on the available space on each OST.  A file would need to be multiple TBs in size in order to hit this limit at OSC.

Disadvantages:

  • Increased overhead - Managing the increased complexity of a distributed file requires additional overhead.  This is especially noticable with smaller files, which can see worse performance due to striping.
  • Increased risk - Spreading a file amongst more OSTs increases the chance that a hardware failure could corrupt your file.  A file striped across 20 OSTs will be partially corrupted by a hardware failure on any of those OSTs.  Contrast that to a file that is not striped, which is only vunerable to a hardware failure on one OST.

Do I Need to Stripe?

Users who wish to stripe their files are encouraged to contact OSC Help to discuss their needs.

Large Files Should Be Striped

Large files should have higher stripe counts to increase the aggregate bandwidth possible and to allow for concurrent I/O from multiple processes.  

Files larger than 500GB must be striped to prevent load imbalances between OSTs.

Small Files Should Not Be Striped

The associated overhead of striping makes striping a bad choice for small files.

Generally speaking, large amounts of small files should be kept off the parallel file system as they degrade the performance of the file system.  In particular, full software installations should be kept off the parallel scratch file system.

Still Not Sure?

Those who are unsure whether they need to stripe their data should contact OSC Help for assistance.  

Striping Basics

There are three basic parameters of the striping process that users can modify:

  • Stripe size - Size of each stripe written on a OST in bytes
  • Stripe count - Number of OSTs to stripe across
  • Stripe offset - Index of OST for first stripe
    • A stripe offset of -1 will allow the system to automatically choose the first OST to write to.  Do not change this without consulting OSC Help.

Each file on a Lustre file system will have a value for each parameter.  If a file does not have these parameters explicitly set before it is written to it will inherit those of its parent directory.

Striping parameters cannot be modified after the file is written to!  Double check that striping parameters are set properly before writing to or moving large files on the parallel scratch file system.

OSC's Default Striping Settings

/fs/lustre has the following default striping parameters set:

Stripe size stripe count stripe offset
1048576 1

-1 

The stripe count of 1 effectively turns off striping by default.  All sub-directories and files of /fs/lustre inherit these striping parameters unless set otherwise.
The stripe offset of -1 allows the system to best choose which OST to write the first stripe to.  Do not change it without discussing your needs with OSC Help.

Lustre Commands

List Striping Settings

The lfs getstripe command returns information on the stripe settings for a file or directory.

lfs getstripe has the following syntax:

lfs getstripe file/directory

Calling lfs getstripe on a directory will return the stripe settings for the directory, as well as any files or sub directories.

$ lfs getstripe /fs/lustre
/fs/lustre
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
/fs/lustre/user123
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1 pool:
[...]

Calling lfs getstripe on a file returns the striping information for that file.

$ lfs getstripe /fs/lustre/example_striping
/fs/lustre/example_striping
lmm_stripe_count:   40
lmm_stripe_size:    1073741824
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  22
obdidx           objid           objid           group
22       637584644     0x2600c504                0
2       637178540     0x25fa92ac                0
[...]

Adding -d only returns stripe information for specified directory.

$ lfs getstripe -d /fs/lustre
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1

Set Striping Settings

The lfs setstripe command sets striping parameters for a file or directory.

lfs setstripe has the following syntax:

lfs setstripe -s stripe_size -c stripe_count -o stripe_offset file/directory

Parameters

lfs setstripe takes the following parameters:

Parameter Description
stripe_size

Size of each chunk in bytes.  Setting to 0 uses the default stripe size.  Can specify units with k, m, or g. 

stripe_count

Number of OSTs to stripe a file across.  Setting to 0 uses the default stripe count.  Setting to -1 will stripe across all available OSTs.  

stripe_offset OST index to write first stripe to.  The default value of -1 will allow the system to choose the starting index.  This should only be modified for advanced use cases.

Calling lfs setstripe on a directory will set the stripe settings for the directory.

$ lfs setstripe -s 1g -c -1 /fs/lustre/example_striped_dir/
$ lfs getstripe /fs/lustre/example_striped_dir /fs/lustre/example_striped_dir
stripe_count: -1 stripe_size: 1073741824 stripe_offset: -1

Files and sub-directories without set striping settings will inherit those of their parent directory.  Striping settings will not change for files already in the directory.

Calling lfs getstripe on a a file will set the stripe settings for the file.  The file can already exist but must not have any data within it.  Attempting to set striping settings for a file that already exists will result in an error.

$ lfs setstripe -s 64m -c -1 /fs/lustre/example_striped_file
$ lfs getstripe /fs/lustre/example_striped_file
/fs/lustre/example_striped_file
lmm_stripe_count:   40
lmm_stripe_size:    67108864
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  10
obdidx           objid           objid           group
10       640111560     0x262753c8                0
26       636851628     0x25f595ac                0
[...]

 

Changing Striping Settings for Existing File with Data

The following demonstrates how to change the striping settings for a file that already exists and already has data writen to it.  To change the striping settings for a file that exists but is still empty follow the lfs setstripe instructions above.

The conclusive method to determine whether a file has data writen to it is to run the command:

du <file>

If you see 0 listed, use the lfs setstripe instructions above.  If larger than 0 you will need to follow the instructions below to set new striping settings.

Introduction

Striping a non-empty file requires the file be copied to a new file that has the striping parameters set and deleting the old file.  This is due to the fact that striping settings cannot be changed for files that are non-empty. 

The example used to illustrate this will show how to move a non-striped file at /fs/lustre/non_striped_file to a striped file at /fs/lustre/striped_file.  The newly striped file with have a stripe count of -1 (stripe across all OSTs).

Set Up Striping Settings

First you need to set up the striping parameters that you wish to have on the existing file on a different empty file

$ lfs setstripe -c -1 /fs/lustre/striped_file

Optionally, one can then check that the striping settings are correct before moving forward.

$ lfs getstripe /fs/lustre/striped_file
stripe_count: -1 stripe_size: 1073741824 stripe_offset: -1

Copy Data to Striped File

Now we can copy data from the previously non-striped file to the newly set up striped file.  

After the new file has data written to it you will be unable to change the striping settings without copying the data once again.  It is especially wise to double check the stripe settings for large files before proceeding.
$ cp /fs/lustre/non_striped_file /fs/lustre/striped_file

Delete Old File

There are no backups for the parallel file system -- ensure your data is as it should be before deleting the old file.  You can verify the two files are the same by using diff or comparing md5 hashes using the command md5sum.

$ rm /fs/lustre/non_striped_file

Rename New File as Old File

If desired, you can rename the new, striped file back to its original name.

In this example renaming the file to non_striped_file does not make logical sense, so we will rename the file to new_file instead.

$ mv /fs/lustre/striped_file /fs/lustre/new_file