Ruby

Ruby is unavailable for general access.
TIP: Remember to check the menu to the right of the page for related pages with more information about Ruby's specifics.

Ruby, named after the Ohio native actress Ruby Dee, is The Ohio Supercomputer Center's newest cluster.  An HP built, Intel® Xeon® processor-based supercomputer, Ruby provides almost the same amount of total computing power (~144 TF) as our former flagship system Oakley on less than half the number of nodes (240 nodes).  Ruby also features two distinct sets of hardware accelerators; 20 nodes are outfitted with NVIDIA® Tesla K40 and another 20 nodes feature Intel® Xeon® Phi coprocessors.

Hardware

Detailed system specifications:

  • 4800 total cores
    • 20 cores/node  & 64 gigabytes of memory/node
  • Intel Xeon E5 2670 V2 (Ivy Bridge) CPUs
  • HP SL250 Nodes
  • 20 Intel Xeon Phi 5110p coprocessors
  • 20 NVIDIA Tesla K40 GPUs
  • 2 NVIDIA Tesla K20X GPUs 
    • Both equiped on single "debug" queue node
  • 1 TB of local disk space in '/tmp'
  • FDR IB Interconnect
    • Low latency
    • High throughput
    • High quality-of-service.
  • Theoretical system peak performance
    • 96 teraflops
  • NVIDIA GPU performance
    • 28.6 additional teraflops
  • Intel Xeon Phi performance
    • 20 additional teraflops
  • Total peak performance
    • ~144 teraflops

Ruby has one huge memory node.

  • 32 cores (Intel Xeon E5 4640 CPUs)
  • 1 TB of memory
  • 483 GB of local disk space in '/tmp'

Ruby is configured with two login nodes.

  • Intel Xeon E5-2670 (Sandy Bridge) CPUs
  • 16 cores/node & 128 gigabytes of memory/node

Connecting

To login to Ruby at OSC, ssh to the following hostname:

         ruby.osc.edu 

You can either use an ssh client application or execute ssh on the command line in a terminal window as follows:

         ssh <username>@ruby.osc.edu

From there, you have access to the compilers and other software development tools. You can run programs interactively or through batch requests. See the following sections for details.

File Systems

Ruby accesses the same OSC mass storage environment as our other clusters. Therefore, users have the same home directory as on the Oakley and Glenn clusters. Full details of the storage environment are available in our storage environment guide.

Software Environment

The module system on Ruby is the same as on the Oakley system. Use module load <package> to add a software package to your environment. Use module list to see what modules are currently loaded and module avail to see the module that are available to load. To search for modules that may not be visible due to dependencies or conflicts, use module spider. By default, you will have the batch scheduling software modules, the Intel compiler and an appropriate version of mvapich2 loaded.

You can keep up to on the software packages that have been made available on Ruby by viewing the Software by System page and selecting the Ruby system.

Understanding the Xeon Phi

Guidance on what the Phis are, how they can be utilized, and other general information can be found on our Ruby Phi FAQ.

Compiling for the Xeon Phis

For information on compiling for and running software on our Phi coprocessors, see our Phi Compiling Guide.

Batch Specifics

Refer to the documentation for our batch environment to understand how to use PBS on OSC hardware. Some specifics you will need to know to create well-formed batch scripts:

  • Compute nodes on Ruby have 20 cores/processors per node (ppn).  
  • If you need more than 64 GB of RAM per node you may run on Ruby's huge memory node ("hugemem").  This node has four Intel Xeon E5-4640 CPUs (8 cores/CPU) for a total of 32 cores.  The node also has 1TB of RAM.  You can schedule this node by adding the following directive to your batch script: #PBS -l nodes=1:ppn=32.  This node is only for serial jobs, and can only have one job running on it at a time, so you must request the entire node to be scheduled on it.  In addition, there is a walltime limit of 48 hours for jobs on this node.
  • 20 nodes on Ruby are equiped with a single NVIDIA Tesla K40 GPUs.  These nodes can be requested by adding gpus=1 to your nodes request, like so: #PBS -l nodes=1:ppn=20:gpus=1.
    • By default a GPU is set to the Exclusive Process and Thread compute mode at the beginning of each job.  To request the GPU be set to Default compute mode, add default to your nodes request, like so: #PBS -l nodes=1:ppn=20:gpus=1:default.
  • Ruby has 6 debug nodes which are specifically configured for short (< 1 hour) debugging type work.  These nodes have a walltime limit of 1 hour  These nodes are equiped with E5-2670 V1 CPUs with 16 cores per a node.  One node is equiped with 2 NVIDIA K20X GPUs.
    • To schedule a debug node:
      • #PBS -l nodes=1:ppn=16 -q debug
    • To schedule the debug node equiped with 2 NVIDIA K20X GPUs:
      • #PBS -l nodes=1:ppn=16:gpus=2 -q debug

Using OSC Resources

For more information about how to use OSC resources, please see our guide on batch processing at OSC. For specific information about modules and file storage, please see the Batch Execution Environment page.

 

Supercomputer: 
Service: 

Technical Specifications

The following are technical specifications for Ruby.  We hope these may be of use to the advanced user.

  Ruby System (2014)
Number oF nodes 240 nodes
Number of CPU Sockets 480 (2 sockets/node)
Number of CPU Cores 4800 (20 cores/node)
Cores per Node 20 cores/node
Local Disk Space per Node ~800GB in /tmp, SATA
Compute CPU Specifications

Intel Xeon E5-2670 V2

  • 2.5 GHz 
  • 10 cores per processor
Computer Server Specifications

200 HP SL230

40 HP SL250 (for accelerator nodes)

Accelerator Specifications

20 NVIDIA Tesla K40 

  • 1.43 TF peak double-precision performance
  • 1 GK110B GPU 
  • 2880 CUDA cores
  • 12GB memory

2 NVIDIA Tesla K20X (Single Node) 

  • 1.17 TF peak double-precision performance
  • 1 GK110 GPU
  • 2688 CUDA cores
  • 6GB memory
  • Only available through debug queue

20 Intel Xeon Phi 5110p 

  • 1.011 TF peak performance
  • 60 cores
  • 1.053 GHz
  • 8GB memory
Number of accelerator Nodes

41 total 

  • 20 Xeon Phi equiped nodes
  • 20 NVIDIA Tesla K40 equiped nodes
  • 1 dual NVIDIA Tesla K20X equiped node
Total Memory ~16TB
Memory Per Node

64GB

Memory Per Core 3.2GB
Interconnect  FDR/EN Infiniband (56 Gbps)
Login Specifications

2 Intel Xeon E5-2670

  • 2.6 GHz
  • 16 cores
  • 132GB memory
Special Nodes

Huge Memory (1)

  • Dell PowerEdge R820 Server
  • 4 Intel Xeon E5-4640 CPUs
    • 2.4 GHz
  • 32 cores (8 cores/CPU)
  • 1 TB Memory

 

Supercomputer: 

Programming Environment

Compilers

C, C++ and Fortran are supported on the Ruby cluster. Intel, PGI and GNU compiler suites are available. The Intel development tool chain is loaded by default. Compiler commands and recommended options for serial programs are listed in the table below. See also our compilation guide.

LANGUAGE INTEL EXAMPLE PGI EXAMPLE GNU EXAMPLE
C icc -O2 -xHost hello.c pgcc -fast hello.c gcc -O2 -march=native hello.c
Fortran 90 ifort -O2 -xHost hello.f90 pgf90 -fast hello.f90 gfortran -O2 -march=native hello.f90

Parallel Programming

MPI

The system uses the MVAPICH2 implementation of the Message Passing Interface (MPI), optimized for the high-speed Infiniband interconnect. MPI is a standard library for performing parallel processing using a distributed-memory model. For more information on building your MPI codes, please visit the MPI Library documentation.

Ruby uses a different version of mpiexec than Oakley or Glenn. This is necessary because of changes in Torque. All OSC systems use the mpiexec command, but the underlying code on Ruby is mpiexec.hydra while the code on Oakley and Glenn was developed at OSC. They are largely compatible, but a few differences should be noted.

Caution: There are many variations on mpiexec and mpiexec.hydra. Information found on non-OSC websites may not be applicable to our installation.

The table below shows some commonly used options. Use mpiexec -help for more information.

OAKLEY RUBY COMMENT
mpiexec mpiexec Same command on both systems
mpiexec a.out mpiexec ./a.out Program must be in path on Ruby, not necessary on Oakley.
-pernode -ppn 1 One process per node
-npernode procs -ppn procs procs processes per node
-n totalprocs
-np totalprocs
-n totalprocs
-np totalprocs
At most totalprocs processes per node (same on both systems)
-comm none   Omit for simple cases. If using $MPIEXEC_RANK, consider using pbsdsh with $PBS_VNODENUM.
-comm anything_else   Omit. Ignored on Oakley, will fail on Ruby.
  -prepend-rank Prepend rank to output
-help -help Get a list of available options

mpiexec will normally spawn one MPI process per CPU core requested in a batch job. The -pernode option is not supported by mpiexec on Ruby, instead use -ppn 1 as mentioned in the table above.

OpenMP

The Intel, PGI and gnu compilers understand the OpenMP set of directives, which give the programmer a finer control over the parallelization. For more information on building OpenMP codes on OSC systems, please visit the OpenMP documentation.

GPU and Phi Programming

To request the GPU node on Ruby, use nodes=1:ppn=20:gpus=1. For GPU programming with CUDA, please refer to CUDA documentation. Also refer to the page of each software to check whether it is GPU enabled.

To request the Xeon Phi (MIC) node on Ruby, use nodes=1:ppn=20:mics=1. For Phi programming, please refer to Ruby Phi FAQ and Phi Compiling Guide

Supercomputer: 
Service: 

Phi Compiling Guide

This document was created to guide users through the compiling and execution of programs for Ruby's Phi coprocessors.  It is not intended to help determine which of the Phi usage models to use.  No special actions are needed for programs running exclusively on the host  For more general information on Ruby and its Phi coprocessors see our Ruby FAQ page.  Only Fortran, C, and C++ code can be compiled to run on the Phi coprocessors.  Code to be run on Ruby or the Xeon Phi coprocessors should be compiled on Ruby.

The Intel Xeon Phi accelerators are referred to as "Phis", and the Intel Xeon CPU as "Host" for this guide

All Usage Models

Users compiling for Ruby should ensure they have the newest version of the Intel Compilers loaded.

Intel compiler suite version 15.0.0 can be loaded with the command:

module load intel/15.0.0

A list of the Intel compiler suite versions available can be seen with:

module spider intel

General Performance Considerations

  • Code should be parallelized.  Due to the simplified architecture of the Phi, serial code run on the Phi will usually be slower than the same serial code run on the host Xeon CPU.  Only through parallel computation can the Phi's power be fully utilized.  
  • Code should be vectorized.  Vectorization is the unrolling of a loop so that one operation can be performed on multiple pairs of operands at once.  The Phi has extra-wide vector units compared to a CPU, increasing the importance of vectorization for performance.
Performance increases due to threading and vectorization
Chart showing the importance of vectorization and multi-threading (Image courtesy Intel)

 

Native Mode

This is the simplest usage model for running code on the Xeon Phi coprocessors.  Code is compiled on the host to be run exclusively on the Phi coprocessor.

To compile an application for the native usage model, use the -mmic compiler flag:

icc -O3 -mmic helloWorld.c -o helloWorld.out

Home directories (rooted at /nfs) are mounted to the Phis, so as long as your application resides within there you do not need to copy it over to the Phi.

You can start your application on the Phi remotely from the host using the following syntax:

ssh mic0-r0007 ~/helloWorld.out
Hello World

Make sure to replace the Phi hostname and application path and name with your own.

If your application requires any shared libraries, make sure they are both in a location accessible from the Phi and specified on the Phi.  Shared locations include all home directories located on /nfs.  You can also copy any necessary libraries over to the Phi's /tmp folder.  

The Phi's have a minimal environment to start with.  If you require a LD_LIBRARY_PATH (or any other environment variables)  for your application you will need to manually set it on the Phis.  If you copied your necessary library files to /tmp, you could do the following from the Phi:

export LD_LIBRARY_PATH=/tmp

To check what environment variables the Phi comes with, run the following from the host:

ssh mic0-r0007 env

MPI Usage

MVAPICH2 can be used within natively compiled code to spawn MPI tasks exclusively on the Phi.  The only additional steps required are the setting of the environment variable I_MPI_MIC to 1 at runtime and making sure your processes are launched on the Phi.

Setting I_MPI_MIC to 1 at runtime enables the MPI library on the host to recognize and work with the Phi:

export I_MPI_MIC=1 

Making sure your processes are executed on the Phi is as simple as specifying to mpiexec to launch on the Phi.  Note the use of mpiexec.hydra, not mpiexec:

mpiexec.hydra -host mic0 -n 16 /tmp/MPI_prog.out

An alternative is to ssh to the Phi and launch mpiexec from there:

mpiexec.hydra -n 16 /tmp/MPI_prog.out

Important performance considerations:

  • Data should be aligned to 64 Bytes (512 bits)
  • Due to the large SIMD width of 64 Bytes, vectorization is crucial
  • Use the -vec-report2 compiler flag to generate vectorization reports to see whether loops have been vectorizied for the Phi architecture
    • If vectorized, messages will read "*MIC* Loop was vectorized" or similar

 

Intel MKL Automatic Offload (AO)

Some Intel MKL functions are Automatic Offload capable; if the library call is made after automatic offloading has been enabled, MKL will automatically decide at runtime whether or not to offload some or all of the calls to the Phi.  This decision is based upon the problem size, load on the processors, and other metrics.  This offloading is completely transparent to the user, and no special compiler options are needed.  If the Phi is not available for any reason, MKL functions will fall back to executing on the host.

Automatic Offload enabled functions

The following Level-3 BLAS functions and LAPACK functions are AO-enabled as of the latest MKL version 11.1, available on Ruby:

  • *GEMM, *SYMM, *TRMM, and *TRSM
  • LU, QR, Cholesky factorizations

* (asterisk) is a wildcard specifying all data types (S, D, C, and Z).

Enabling and Disabling Automatic Offload

Automatic Offload can be both enabled and disabled through the setting of an environment variable or the call of a support function.  Compiler pragmas are not needed -- users can compile and link code the usual way.

To enable AO in FORTRAN or C code:

rc = mkl_mic_enable()

Alternatively, to enable AO through an environment variable:

export MKL_MIC_ENABLE=1

 

To disable AO in FORTRAN or C code:

rc = mkl_mic_disable()

Alternatively, to disable AO through an environment variable:

export MKL_MIC_ENABLE=0

Using Automatic Offload and Compiler Assisted Offload in the same program

The Intel MKL library supports the use of both Automatic Offload and Compiler Assisted Offload in the same program.  When doing so, users need to explicitly specify work division for AO aware functions using support functions or environment variables.  By default, if the work division is not specified, all work will be done on the host.  

Force execution failure if offload not available

Intel MKL will default to running computations on the host if the Phi is not available for any reason.  Whether or not computations were offloaded to the Phi will not be apparent to the user.  To force execution to fail if the offload fails, use the following command to set the proper environment variable:

export MKL_MIC_DISABLE_HOST_FALLBACK=1

Setting this will cause programs to exit with the error message "Could not enable Automatic Offload" if an offload attempt fails.

Generate offload report

By default, automatic offload operations are transparent to the user; whether or not work was offloaded and how much of that work was offloaded will not be apparent to the user.  To allow users to examine these details, MKL can generate an offload report at runtime.  The environment variable OFFLOAD_REPORT needs to be set to 1 or 2 before runtime to do this.

export OFFLOAD_REPORT=1

Setting OFFLOAD_REPORT to 0 (or not setting it) results in no offload report.

Setting OFFLOAD_REPORT to 1 results in a report including:

  • Name of function called
  • Effective Work Division
  • Time spent on Host during call
  • Time spent on each available Phi coprocessor during call

Setting OFFLOAD_REPORT to 2 results in a report including everything from 1, and in addition:

  • Amount of data transferred to and from each Phi during call

Important performance considerations:

  • Automatic offload performs best on large, square matrices

For more information on using the Intel MKL automatic offload feature, refer to Intel's guide on the subject.

 

Compiler Assisted Offload (CAO)

In Compiler Assisted Offload, pragmas, also known as directives, are added to the code specifying sections of that code to offload their execution to the Phis.  These offload regions do not require any special coding considerations, and can utilize OpenMP and Intel Clik programming models.  When the compiler reaches an offload pragma, it generates code for both the host and the Phi.  The resulting executable consists of code for both the host and the Phi.

Currently, the Intel compiler supports Intel's Language Extensions for Offload (LEO) for markup.  It is expected version 4.0 of the OpenMP standard will include offload directives for the Phi coprocessors as well.

Adding offload directives

The primary step in preparing code for CAO is to add directives specifying when and how to offload code to the Phi.  Here is a basic example of what these offload directives look like in C:

int main(){
...
    //offload code
    #pragma offload target(mic)
    {
        //parallelisms via OpenMP on the MIC
        #pragma omp parallel for
        for( i = 0; i < k; i++ ){
            for( j = 0; i < k; j++ ){
                a[i] = tan(b[j]) + cos(c[j]);
            }
        } //end OpenMP section
    } //end offload section
...
}

..and the same example in Fortran:

program main
...
!dir$ offload begin target(mic)
!$omp parallel do
do i = 1,K
    do j = 1,K
        a(i) = tan(b(j)) + cos(c(j))
    end do
end do
!dir$ end offload
...
end program
   

Specifiers can be added to specify the target Phi (useful for when multiple Phis are available) and to control the flow of data to and from the Phi.  An example of these specifiers in C:

#pragma offload target(mic:0) inout(a) in(b,c)

This directive is specifying:

  • This section of code be offloaded to a specific Phi coprocessor, in this case 0.
  • The inout specifier defines a variable be both copied to the Phi and back to the host.
  • The in specifier defines a variable as strictly input to the coprocessor.  The value is not coped back to the host

For more information on directives and additional specifiers refer to Intel's Effective Use of the Intel Compiler's Offload Features.

Compilation

No additional steps are required at the compile or link stage.  

Execution

No special steps are required at runtime; offload of specified sections of code and data transfers are automatically handled.

Controlling Offload with Environment Variables

Environment variables can be used to affect the way the offload runtime library operates.  These environment variables are prefixed with either "MIC_" or "OFFLOAD_".  Listed below are some commonly used environment variables:

MIC_LD_LIBRARY_PATH

Sets the path where shared libraries needed by the MIC offloaded code reside.

OFFLOAD_REPORT

When set to 1 or 2, offload details are printed to standard out, with 2 including details of data transfers.

OFFLOAD_DEVICES

Restricts the process to only use the specified Phis.  Multiple Phis can be specified using commas.

MIC_ENV_PREFIX

By default, all environment variables defined on the host are replicated to the coprocessors execution environment when an offload occurs.  This behavior can be modified by defining this environment variable.  When defined, only environment variables on the host prefixed with MIC_ENV_PREFIX's value are passed on to the Phi.  The passed environment variables are set on the Phi with the prefix stripped.  This is particularly valuable for controlling OpenMP, MPI, and Intel Clik environment variables. 

Setting MIC_ENV_PREFIX has no effect on the fixed MIC_* environment variables such as MIC_LD_LIBARAY_PATH.

MPI

While calling MPI functions within offload regions is not supported, offloading within a MPI program is supported by the Intel MPI library.  When offloading, however, no attempt is made to coordinate the Phi's resource usage amongst the MPI ranks.  If 12 MPI ranks running on the host all offload 8 threads to the Phi, all of these threads will be spawned on the first 8 cores of the Phi.  As can be seen, this can quickly lead to resource conflicts.  A performance penalty is also incurred when multiple ranks offload simultaneously to a single Phi.  

Mitigating these issues is beyond the scope of this guide; please refer to Using MPI and Xeon Phi Offload Together for more information.

For more detailed information on programming for the CAO model please refer to Intel's Effective Use of the Intel Compiler's Offload Features.

 

Symmetric/Heterogeneous Offload

Called both Symmetric and Heterogeneous offloading, this programming model treats the Phi as simply another node in a heterogeneous cluster.  MPI ranks are spawned on both the host and Phi.  Because the Phi cannot run a executable compiled for the host, two separate executables need to be prepared.  Getting these separate executables to run from the same mpiexec.hydra call requires adding a prefix or postfix to the Phi executables name and setting the respective environment variable.

Setup

Remember to source both the compilervars and mpivars files before starting as outlined in the all usage models section above.

Make sure to have your desired implementation loaded before compilation.  We recommend using the MVAPICH2 MPI implementation that is loaded by default.

Compilation

Executables must be compiled for both the host and Phi separately. You must use the Intel compiler and the Intel MPI implementation.  To compile the phi executable, include the -mmic flag at compilation.  No special considerations are required for the host executable.

# Create host executable
mpicc helloworld.c -o helloworld.out

# Create Phi executable
mpicc -mmic helloworld.c -o helloworld.out.mic

Execution

Make sure to have your chosen MPI implementation module loaded at runtime.

Once in a job, first, create a MPI hosts file containing the hosts to run on on separate lines:

-Bash-4.1$ cat mpi_hosts
r0001
mic0-r0001

Notice that mic# goes before the Xeon hostname, separated by a hyphen.  In this case we will target both one Xeon CPU and one Phi coprocessor.

Then set I_MPI_MIC to 1 so the MPI library on the host recognizes and works with the Phi:

export I_MPI_MIC=1 

Next, let MPI know how you identify your Phi executable in comparison to your host executable.  In our case we used the postfix .micto identify our Phi executable and thus we will need to set I_MPI_MIC_PREFIX.

export I_MPI_MIC_POSTFIX=.mic

Alternatively a prefix can be used.  The prefix option enables Phi specific executables to be stored in a separate directory.

Finally, from the host start the program up. Note the use of mpiexec.hydra, not mpiexec.

mpiexec.hydra -f mpi_hosts -pernost 1 -n 2 helloworld.out

 

MPI

Coming soon.

 

 

Supercomputer: 

Ruby Phi FAQ

The Ruby cluster is composed of both standard Intel Xeon CPUs as well as new Xeon Phi coprocessors.  Special considerations must be taken both when compiling for and running software on the Phi coprocessors.  This guide provides general information on the Phi coprocessors and breaks down the different types of programming models available for them.  For detailed information on compiling software for our Phis, please refer to our Phi Compiling Guide

 

The Intel Xeon Phi coprocessors are referred to as "Phis", and the Intel Xeon CPU as "Host". 

What are the Xeon Phi coprocessors?

The Xeon Phi coprocessors (commonly referred to as both accelerators and MICs) can be thought of as complementary add-ons to Ruby's standard Xeon Host CPUs.  Much like the GPU accelerators available on Oakley and Glenn, they are used to increase performance by providing a specialized computational environment optimized for certain operations.  For certain operations, the Phis will be orders of magnitude faster than the same operation run on the CPU.  

How are the Phi coprocessors different than GPUs?

The goal of each is to do the same thing, but how they do this differs.  Xeon Phis run Intel assembly code similar to that of the Xeon Host CPUs.  By simply recompiling source code for the Phi, most programs can greatly benefit.  On the other hand, GPUs traditionally run their own proprietary code, requiring programs to be tediously tailored for these specific GPUs before they can be compiled and run. 

 

Should I use the Phis?

Short answer: If you are going to run your code on Ruby, yes.  Getting code to run on the Phis can be such a simple process and result in significant gains that there is little reason not to.  

For more help on determining whether to use the Phi, refer to Intel's guide on the subject.

You do not have to take advantage of the Phis to run code on Ruby. 

 

How can I use the Phis?

There are three main ways of taking advantage of the Phi's computing power:

  1. Native Execution

    • Compile binary for Phi ONLY
    • Done with -mmic compiler flag
    • SSH to Phi and then run
    • Good for getting familiar with Phi characteristics
    • Host sits idly during code execution
  2. Symmetric/Heterogeneous Execution (Using MPI)

    • Compile and run code on both Host and Phi
    • Both the Host and the Phi operate "symmetrically" as MPI targets
    • Requires careful load balancing between MPI tasks due to differences between Host and Phi (as well as OpenMP threads if taking the hybrid approach)
  3. Offload Execution

    • Automatic Offload (AO) with Intel Math Kernel Library (MKL)
      • Some MKL routines are automatically offloaded to Phi when code is run on the Host
      • Does not require any changes to code -- completely transparent to the user
      • MKL determines if computation will benefit from offloading
    • Compiler Assisted Offloading (CAO)
      • Add offload directives/pragmas to parts of code you want offloaded
      • If the Phi is unavailable for any reason, code defaults back to running on the Host
      • Two programming sub-models differing on data movement
        1. Explicit - Code directs data movement to/from Phi 
          • Only supports arrays of scalar or bitwise copyable structure or class.  For more complex C/C++ data types, use implicit model
          • Uses #pragma offload construct
        2. Implicit - Code establishes virtual "shared memory" model, data is synchronized automatically between Host and Phi at established points
          • only available for C/C++
          • appropriate for complex, pointer-based data structure (linked lists, binary trees, etc.)
          • Uses _Cilk_shared and _Cilk_offload constructs
          • Not appropriate for very large data

What programming languages can you use for the Phis?

Only code compiled from C/C++ and Fortran can be run on the Phis.

Can I still use X for parallel programming?

The Phis have available most parallel programming options available on the Host.  Specifically, the Phis are known to support the following:

  • MVAPICH2 (OSC's recommended MPI library)
  • OpenMP
  • ​Intel Cilk Plus
  • pthreads
  • Intel Threading Building Blocks (Intel TBB)

What sections of code should I offload? (CAO only)

Highly-parallel sections of code are good candidates for offload.  Serial code offloaded will run much slower than on the Host.

Data transfers between the Phi and Host must also be taken into consideration when choosing sections of code to offload. Data transfers are slow and should be minimized.  If two offloaded parallel sections of code have a serial section between them and they all act on the same data, it may be more efficient to offload the serial section as well.  This eliminates the need to transfer the data back to the Host, run the serial section, and then transfer this data back to the Phi.

 

How do I run code on the Phis?

MKL, OpenMP, or MPI based programs

To run a MKL, OpenMP, or MPI based program on the Phis, some libraries may need to be copied over.  

 

How do I set up environment variables on the Phi?

By default, all environment variables set on the Host are passed to the Phi.  This behavior can be over-ridden by setting the MIC_ENV_PREFIX to a string.  Then, only environment variables prefixed by this string will then be passed to the Phi's environment.

For example, setting MIC_ENV_PREFIX to PHI would cause only environment variables prefixed with PHI to be passed (PHI_PATH, PHI_LIBRARY, etc.).

MIC_LD_LIBRARY_PATH is not stripped and passed to the Phi, and thus MIC_ENV_PREFIX=MIC will not work to change the Phi's LD_LIBRARY_PATH

See Intel's Setting Environment Variables on the CPU to Modify the Coprocessor's Execution Environment for more information on passing and setting environment variables.

 

Supercomputer: 

Using the Intel Xeon Phi on Ruby

Introduction

Twenty of the new Ruby nodes have an Intel Xeon Phi coprocessor.  Some of the older debug nodes do as well.  This guide explains how to build and run code for the Phi on Ruby.  It does not discuss programming techniques or performance issues.

For background information on the Xeon Phi and techniques for using the Phi efficiently, the following references may be useful: 

The Xeon Phis are somewhat difficult to configure and use. Information that you find in documents from Intel or other sources, particularly TACC, may need modification to work at OSC. Here’s a guide to what you can use directly and what needs to be adapted for Ruby.

  • Information that’s the same everywhere: Xeon Phi architecture, capabilities, programming advice, compiling and linking, overviews of usage modes, environment variables
  • Information that’s overridden by this guide for Ruby: Batch system usage, accessing the Phi, setting up the host environment, running programs on the Phi, file management, some MPI information

Several examples have been created to illustrate various Phi usage models. Code for the examples is available on Ruby. Detailed instructions are included below for building and running them.

Accessing the Xeon Phi Nodes on Ruby

The Phis are accessed through the batch system with either an interactive or a regular job. You must add “:mics=1” to your “nodes” request. A few of the debug nodes have 2 mics.

Some examples:

#PBS -l nodes=4:ppn=20:mics=1
qsub -l nodes=1:ppn=20:mics=1
qsub -I -l nodes=1:ppn=16:mics=1 -q debug
qsub -I -l nodes=1:ppn=16:mics=2 -q debug

Your job executes on the Xeon processor. You can use the Xeon Phi through one of the programming models described below.

Note: Use mvapich2 for MPI usage on the Xeon Phis on Ruby.

Programming Models for the Xeon Phi

The Intel Xeon Phi is a coprocessor, or accelerator, that can be attached to an Intel Xeon processor, which is referred to as the host. The Phi is also known as a Many-Integrated-Core processor, or MIC, pronounced “Mike”.

Programming for the Phi is done with the Intel version 15 (or higher) compiler in C/C++ or Fortran. The Intel module on Ruby, which is loaded automatically at login, sets up the host environment.  

In addition to the Intel module, the mic module is necessary to set up the environment:

module load mic
As of the Ruby software update on September 15th the mic module has been integrated with the Intel module and no longer needs to be separately loaded.

There are three ways to use the Phi, with variations on each. These programming models are described next. Examples for each are given in the next section.

Compiler Notes

Fortran note: To activate preprocessing during compilation, either use the capitalized “.F90” suffix or add the  -fpp  flag.

Note: -qopenmp replaces -openmp , which has been deprecated in the newest Intel compilers.

Optimization flags are mostly the same as for the host.

Native Computing

Most programs that run on the Xeon host can be recompiled to run natively on the Xeon Phi coprocessor. Codes that vectorize and parallelize well may run faster on the Phi than on the host; other codes may perform poorly.

You can login to the Phi and run your mic-built program the way you would on the host. You can also run a native mic program directly from the host. The Phi runs a stripped-down version of Linux called BusyBox. The only shell available is sh.

Note: The env  command does not show the LD_LIBRARY_PATH variable on the Phi; you can see its value with “echo $LD_LIBRARY_PATH”.

On Ruby the host node (Xeon) has a name of the form r0221. The associated Xeon Phi in this case is named mic0-r0221. If you are logged into a node that has a Phi you can log into the Phi with:

ssh mic0-$(hostname)

Home directories are mounted on the Phis, but the gpfs and lustre file systems are not. Most of the environment you’re used to seeing on the host is not available on the Phi.

All compilation for the Phi is done on the host. Simply add the flag -mmic  to the compilation line (and remove any other -m or -x flag). The .mic extension is optional; it is often used to distinguish coprocessor native executables.

icc -O3 -openmp -mmic mysource.c -o myapp.mic
ifort –O3 -openmp -mmic mysource.f90 -o myapp.mic

To run a native program from the host, use the micnativeloadex command. This command uses the   SINK_LD_LIBRARY_PATH environment variable to search for shared library dependencies on the coprocessor. Note that MIC_LD_LIBRARY_PATH (used for offload model) is initialized by the Intel compiler module but SINK_LD_LIBRARY_PATH is not.

export SINK_LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH
micnativeloadex myapp.mic

You can also login to the coprocessor as described above and run your program normally.

ssh mic0-$(hostname)
./myapp.mic

It is possible, and quite common, to run MPI programs on a one or multiple Xeon Phi coprocessors. For the Xeon Phis on Ruby mvapich2 is the supported MPI implementation.

Reference: https://software.intel.com/en-us/articles/building-a-native-application-for-intel-xeon-phi-coprocessors

Offload Computing

Offload computing represents a true coprocessor model. A program runs on the host and offloads portions of its work to the coprocessors. There are several ways to use the offload programming model.

Automatic Offload - MKL

Automatic offload is a feature of the Intel Math Kernel Library (MKL). Certain computationally intensive MKL functions are automatic-offload capable. (See references below.) If automatic offload is enabled, through either an environment variable or a function call, MKL will automatically divide the work of these functions between the host and the coprocessor if the problem size warrants it. Automatic offload applies only to host MKL calls made outside of offload sections.

These environment variables should be set when using automatic offload:

export MKL_MIC_ENABLE=1
export OMP_NUM_THREADS=20
export MIC_OMP_NUM_THREADS=240

MKL_MIC_ENABLE enables automatic offload. OMP_NUM_THREADS specifies the number of OpenMP threads on the host. MIC_OMP_NUM_THREADS specifies the number of OpenMP threads on the coprocessor for offload code.

MKL can be used in other modes as well. MKL calls can be made in offload code; it is also available in native mode.

Compiler-assisted offload – explicit model

With the explicit offload model the programmer controls data movement and code execution through the use of compiler directives (Fortran) or pragmas (C/C++).

Compiler-assisted offload – implicit model

Implicit offload uses a virtual shared memory model. It uses Cilk Plus, an extension to the C and C++ languages, and is available for C/C++ only. 

Detecting and monitoring offload

At compile time you can get offload information with the option -opt-report-phase=offload .

At run time the environment variable OFFLOAD_REPORT can be set on the host to provide offload information. Valid values are 0, 1, 2, and 3, with 3 providing the most information and 0 disabling the report. It is also possible to ssh to mic0-$(hostname) and run “top” to see offload processes running.

Here’s sample output from the Fortran offload example with OFFLOAD_REPORT set to 1.

[r0222]$ export OFFLOAD_REPORT=1
[r0222]$ ./tbo_sort.out
Fortran Tutorial: Offload Demonstration
Checking for Intel(R) Xeon Phi(TM) (Target CPU) devices...
Number of Target devices installed:      1
[Offload] [MIC 0] [File]                    tbo_sort.F90
[Offload] [MIC 0] [Line]                    183
[Offload] [MIC 0] [Tag]                     Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        0.737751(seconds)
[Offload] [MIC 0] [Tag 0] [MIC Time]        0.024227(seconds)
[Offload] [MIC 0] [File]                    tbo_sort.F90
[Offload] [MIC 0] [Line]                    209
[Offload] [MIC 0] [Tag]                     Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        0.003644(seconds)
[Offload] [MIC 0] [Tag 1] [MIC Time]        0.000086(seconds)
[Offload] [MIC 0] [File]                    tbo_sort.F90
[Offload] [MIC 0] [Line]                    252
[Offload] [MIC 0] [Tag]                     Tag 2
[Offload] [HOST]  [Tag 2] [CPU Time]        0.004511(seconds)
[Offload] [MIC 0] [Tag 2] [MIC Time]        0.000108(seconds)
Unsorted original values...first twenty (20) values:
Evens and Odds:
1    2    3    4    5    6    7    8    9   10
11   12   13   14   15   16   17   18   19   20
Sorted results...first ten (10) values each:
Evens:     2    4    6    8   10   12   14   16   18   20
Odds :     1    3    5    7    9   11   13   15   17   19
Primes:     2    3    5    7   11   13   17   19   23

References

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features

https://software.intel.com/en-us/articles/intel-mkl-automatic-offload-enabled-functions-for-intel-xeon-phi-coprocessors

https://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-on-intel-xeon-phi-coprocessor

Symmetric Computing

Symmetric computing, also known as  heterogeneous computing, involves running an MPI program with ranks on both the host and the coprocessor. It is “symmetric” in the sense that the same program executes on both the host and the coprocessor. It is “heterogeneous” because different processor types are involved.

The symmetric computing model does not yet work on Ruby.

Examples

The sole purpose of these examples is to illustrate how to build and run code for the Intel Xeon Phis on Ruby. They do not necessarily follow good programming techniques, nor are they suitable for performance comparisons.

The examples are taken from the Intel compiler distribution on Ruby. Some were written as mic examples; others were adapted from sample code written for a normal cpu. A path is provided for the original version of all sample code. Paths into the Intel compiler directory are given relative to $MKLROOT because that’s the only environment variable available. Paths are also provided for the working copies.

Program Written for Host Run Natively on MIC – Fortran Example

This code is an OpenMP program written for the CPU. We build and run it also as a native MIC application.

Original path:  $MKLROOT/../Samples/en_US/Fortran/openmp_samples/openmp_sample.f90

Working directory:  /nfs/14/judithg/ruby/mic/Samples/Fortran/openmp_samples

Description: This code finds all primes in the first 10,000,000 integers, the number of 4n+1 primes, and the number of 4n-1 primes in the same range.

 

Start batch job:


[ruby02]$ qsub -I -l nodes=1:ppn=20:mics=1
qsub: waiting for job 40901 to start
qsub: job 40901 ready
[r0221]$ cd ~/ruby/mic/Samples/Fortran/openmp_samples

Build and run on the host:


[r0221]$ ifort -o openmp_sample -xHost -qopenmp -fpp openmp_sample.f90
[r0221]$ ./openmp_sample
Range to check for Primes:           1    10000000
We are using          20  thread(s)
Number of primes found:      664579
Number of 4n+1 primes found:      332181
Number of 4n-1 primes found:      332398

Build native MIC application on the host:


[r0221]$ ifort -o openmp_sample.mic -mmic -qopenmp -fpp openmp_sample.f90

Run by logging into MIC:


[r0221]$ ssh mic0-$(hostname)
-sh-4.2$ cd ~/ruby/mic/Samples/Fortran/openmp_samples
-sh-4.2$ ./openmp_sample.mic
Range to check for Primes:           1    10000000
We are using         240  thread(s)
Number of primes found:      664604
Number of 4n+1 primes found:      332206
Number of 4n-1 primes found:      332398
-sh-4.2$  exit
logout
Connection to mic0-r0221 closed.

Run from the host:


[r0221]$ export SINK_LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH
[r0221]$ micnativeloadex ./openmp_sample.mic
Range to check for Primes:           1    10000000
We are using         236  thread(s)
Number of primes found:      664604
Number of 4n+1 primes found:      332206
Number of 4n-1 primes found:      332398

Notes

The application gave different results on the MIC than it did on the host, suggesting a bug in the software stack.

Program Written for Host Run Natively on MIC – C Example

This code is an OpenMP program written for the CPU. We build and run it also as a native MIC application.

Original path:  $MKLROOT/../Samples/en_US/C/openmp_samples/openmp_sample.c

Working directory:  /nfs/14/judithg/ruby/mic/Samples/C++/openmp_samples

Description: Matrix multiplication

Start batch job:


[ruby02]$ qsub -I -l nodes=1:ppn=20:mics=1
qsub: waiting for job 40901 to start
qsub: job 40901 ready
[r0221]$ cd ~/ruby/mic/Samples/C++/openmp_samples

Build and run on the host:


[r0221]$ icc -o openmp_sample -xHost -qopenmp openmp_sample.c
[r0221]$ ./openmp_sample
Using time() for wall clock time
Problem size: c(600,2400) = a(600,1200) * b(1200,2400)
Calculating product 5 time(s)
We are using 20 thread(s)
Finished calculations.
Matmul kernel wall clock time = 1.00 sec
Wall clock time/thread = 0.05 sec
MFlops = 17280.000000

Build native MIC application on the host:

Note: This can also be done on a login node.


[r0221]$ icc -o openmp_sample.mic -mmic -qopenmp openmp_sample.c

Run by logging into MIC:


[r0221]$ ssh mic0-r0221
-sh-4.2$ cd ~/ruby/mic/Samples/Fortran/openmp_samples
-sh-4.2$ ulimit -s unlimited
-sh-4.2$ ./openmp_sample.mic
-sh-4.2$  exit
logout
Connection to mic0-r0221 closed.

Note

The stack size on the MIC by default is 8192 kbytes. This application requires a larger stack, so we have to change it. Otherwise it fails with a segmentation fault.

MKL Automatic Offload – Fortran

This code was provided by Intel as an example of automatic offload.

Original path:  $MKLROOT/examples/examples_mic.tgz

Working directory:  /nfs/14/judithg/ruby/mic/Samples/mkl/mic_ao/blasf/work

Description: Runs SGEMM – matrix multiplication

Build the example:

Build on the host, on either a login node or a compute node. The makefile hides all the interesting details, so individual commands are given here.


[r0222]$ ifort -c -O3 -openmp sgemm.f90
[r0222]$ ifort -o sgemm.out sgemm.o -mkl

Run the example:


[r0222]$ export MKL_MIC_ENABLE=1
[r0222]$ export MIC_OMP_NUM_THREADS=240
[r0222]$ ./sgemm.out
Computing SGEMM on the host
Enabling Automatic Offload
Automatic Offload enabled: 1 MIC devices present
Computing SGEMM with automatic workdivision
Setting workdivision for device MIC: 0 to 1.0
Resulting workdivision configuration:
workdivision[HOST: 0] = -1.0
workdivision[MIC: 0] =  1.0
Computing SGEMM on device  0
Done

Inside the code:

This is unmodified host code.

MKL Automatic Offload – C

This code was provided by Intel as an example of automatic offload.

Original path:  $MKLROOT/examples/examples_mic.tgz

Working directory:  /nfs/14/judithg/ruby/mic/Samples/mkl/mic_ao/blasc/work

Description: Runs SGEMM – matrix multiplication

Build the example:

Build on the host, on either a login node or a compute node. The makefile hides all the interesting details, so individual commands are given here.


[r0222]$ icc -c -O3 -openmp sgemm.f90
[r0222]$ icc -o sgemm.out sgemm.o -mkl

Run the example:


[r0222]$ export MKL_MIC_ENABLE=1
[r0222]$ export MIC_OMP_NUM_THREADS=240
[r0222]$ ./sgemm.out
Computing SGEMM on the host
Enabling Automatic Offload
Automatic Offload enabled: 1 MIC devices present
Computing SGEMM with automatic workdivision
Setting workdivision for device MIC:00 to 1.0
Resulting workdivision configuration:
workdivision[HOST] = -1.00
workdivision[MIC:00] = +1.00
Computing SGEMM on device 00
Done

Inside the code:

This is unmodified host code.

Explicit Offload Example – Fortran

This code was provided by Intel as an example of explicit offload. It was originally part of a tutorial.

Original path:  $MKLROOT/../Samples/en_US/Fortran/mic_samples/LEO_tutorial

Working directory:  /nfs/14/judithg/ruby/mic/Samples/Fortran/LEO_tutorial

Description: Sorts a list of numbers, identifies evens, odds and primes

Build the example:

Build on the host, on either a login node or a compute node.


make mic

Compile and link commands:


ifort -qopenmp  -c tbo_sort.F90 -o tbo_sort.o
ifort -V tbo_sort.o -qopenmp   -o tbo_sort.out

Run the example:


[r0222]$ ./tbo_sort.out
Fortran Tutorial: Offload Demonstration
Checking for Intel(R) Xeon Phi(TM) (Target CPU) devices...
Number of Target devices installed:      1
Unsorted original values...first twenty (20) values:
Evens and Odds:
1    2    3    4    5    6    7    8    9   10
11   12   13   14   15   16   17   18   19   20
Sorted results...first ten (10) values each:
Evens:     2    4    6    8   10   12   14   16   18   20
Odds :     1    3    5    7    9   11   13   15   17   19
Primes:     2    3    5    7   11   13   17   19   23

Inside the code:

This programming model uses compiler directives. See the source code for details, but here is a sample line of code.


!DIR$ OFFLOAD BEGIN target(mic : target_id) mandatory &
inout(numEs) in(all_Vals) out(E_vals

Explicit Offload Example – C

This code was provided by Intel as an example of explicit offload.

Original path:  $MKLROOT/../Samples/en_US/C++/mic_samples/intro_sampleC

Working directory:  /nfs/14/judithg/ruby/mic/Samples/C++/intro_sampleC

Description: Runs several test functions

Build the example:

Build on the host, on either a login node or a compute node.


make mic

Typical compile command:


icc -qopenmp  -c sampleC00.c -o sampleC00.o

Simplified link command:


icc *.o -qopenmp   -o intro_sampleC.out

Run the example:


[r0222]$ ./intro_sampleC.out
Samples started
Checking for Intel(R) Xeon Phi(TM) (Target CPU) devices...
Number of Target devices installed: 1
Offload sections will execute on: Target CPU (offload mode)
PASS Sample01
PASS Sample02
PASS Sample03
PASS Sample04
PASS Sample05
PASS Sample06
PASS Sample07
PASS Sample08
PASS Sample09
PASS Sample10
PASS Sample11
PASS Sample12
PASS Sample13
PASS Sample14
Samples complete

Inside the code:

This programming model uses compiler pragmas. See the source code for details, but here is a sample line of code.


#pragma offload target(mic) optional inout(myglob

Implicit Offload Example – C++

This example is almost identical to the C example of implicit offload.

Original path:  $MKLROOT/../Samples/en_US/C++/mic_samples/shrd_sampleCPP

Working directory:  /nfs/14/judithg/ruby/mic/Samples/C++/shrd_sampleCPP

Description: Runs several test functions

Build the example:

Build on the host, on either a login node or a compute node.


make mic

Typical compile command:


icpc -qopenmp  -c shrd_ofld00.cpp -o shrd_ofld00.o

Simplified link command:


icpc *.o -qopenmp   -o shrd_sampleCPP.out

Run the example:


[r0222]$ ./shrd_sampleCPP.out
Samples started
Checking for Intel(R) Xeon Phi(TM) (Target CPU) devices...
Number of Target devices installed: 1
Offload sections will execute on: Target CPU (offload mode)
PASS shrd_ofld01
PASS shrd_ofld02
PASS shrd_ofld03
PASS shrd_ofld04
PASS shrd_ofld05
PASS shrd_ofld06
PASS shrd_ofld07
PASS shrd_ofld08
PASS shrd_ofld09
PASS shrd_ofld10
PASS shrd_ofld11
PASS shrd_ofld_vt01
PASS shrd_ofld_vt02
PASS shrd_ofld_vt03
PASS shrd_ofld_link
Samples complete

Inside the code:

This programming model uses Cilk. See the source code for details, but here is a sample line of code.


_Cilk_shared int chk_target00();

 

 

 

Executing Programs

Batch Requests

Batch requests are handled by the TORQUE resource manager and Moab Scheduler as on the Oakley and Glenn systems. Use the qsub command to submit a batch request, qstat to view the status of your requests, and qdel to delete unwanted requests. For more information, see the manual pages for each command.

There are some changes for Ruby, they are listed here:

  • Ruby nodes have 20 cores per node, and 64 GB of memory per node. This is less memory per core than on Oakley.
  • Ruby will be allocated on the basis of whole nodes even for jobs using less than 20 cores.
  • The amount of local disk space available on a node is approximately 800 GB.
  • MPI Parallel Programs should be run with mpiexec, as on Oakley, but the underlying program is mpiexec.hydra instead of OSC's mpiexec. Type mpiexec --help for information on the command line options.

Example Serial Job

This particular example uses OpenMP.

  #PBS -l walltime=1:00:00
  #PBS -l nodes=1:ppn=20
  #PBS -N my_job
  #PBS -j oe

  cd $TMPDIR
  cp $HOME/science/my_program.f .
  ifort -O2 -openmp my_program.f
  export OMP_NUM_PROCS=20
  ./a.out > my_results
  cp my_results $HOME/science

Please remember that jobs on Ruby must use a complete node.

Example Parallel Job

    #PBS -l walltime=1:00:00
    #PBS -l nodes=4:ppn=20
    #PBS -N my_job
    #PBS -j oe

    cd $HOME/science
    mpif90 -O3 mpiprogram.f
    cp a.out $TMPDIR
    cd $TMPDIR
    mpiexec ./a.out > my_results
    cp my_results $HOME/science

For more information about how to use OSC resources, please see our guide on batch processing at OSC. For specific information about modules and file storage, please see the Batch Execution Environment page.

Supercomputer: 
Service: 

Queues and Reservations

Here are the queues available on Ruby. Please note that you will be routed to the appropriate queue based on your walltime and job size request.

Name Nodes available max walltime max job size notes

Serial

Available minus reservations

168 hours

1 node

 

Parallel

Available minus reservations

96 hours

40 nodes

 

Hugemem

1

48 hours

1 node

32 core with 1 TB RAM
Debug 6 1 hour 2 nodes

16 core with 128GB RAM

For small interactive and test jobs. 

Use "-q debug" to request it 

1 node is equiped with 2 NVIDIA K20X GPUs.  Use -l nodes=1:ppn=16:gpus=2 -q debug to schedule it.

"Available minus reservations" means all nodes in the cluster currently operational (this will fluctuate slightly), less the reservations. To access one of the restricted queues, please contact OSC Help. Generally, access will only be granted to these queues if performance of the job cannot be improved, and job size cannot be reduced by splitting or checkpointing the job.

Occasionally, reservations will be created for specific projects.

Approximately half of the Ruby nodes are a part of client condo reservations. Only jobs of short duration are eligible to run on these nodes, and only when they are not in use by the condo clients. As a result, your job(s) may have to wait for eligible resources to come available while it appears that much of the cluster is idle.
Supercomputer: 
Service: 

Batch Limit Rules

Full Node Charging Policy

On Ruby, we always allocate whole nodes to jobs and charge for the whole node. If a job requests less than a full node (nodes=1:ppn<20), the job execution environment is what is requested (the job only has access to the # of cores according to ppn request) with 64GB of RAM; however, the job will be allocated whole node and charge for the whole node. A job that requests nodes>1 will be assigned the entire nodes with 64GB/node and charged for the entire nodes regardless of ppn request.  A job that requests huge-memory node (nodes=1:ppn=32) will be allocated the entire huge-memory node with 1TB of RAM and charged for the whole node (32 cores worth of RU).

To manage and monitor your memory usage, please refer to Out-of-Memory (OOM) or Excessive Memory Usage.

Queue Default

Please keep in mind that if you submits a job with no node specification, the default is nodes=1:ppn=20, while if you submits a job with no ppn specified, the default is nodes=N:ppn=1

Debug Node

Ruby has 6 debug nodes which are specifically configured for short (< 1 hour) debugging type work. These nodes have a walltime limit of 1 hour. These nodes are equiped with E5-2670 V1 CPUs with 16 cores per a node. One node is equiped with 2 NVIDIA K20X GPUs. To schedule a debug node, use nodes=1:ppn=16 -q debug.  To schedule the debug node equiped with 2 NVIDIA K20X GPUs, use nodes=1:ppn=16:gpus=2 -q debug

GPU and Intel Xeon Phi (MIC) Node

On Ruby, 20 nodes are equipped with NVIDIA Tesla K40 GPUs (one GPU with each node).  These nodes can be requested by adding gpus=1 to your nodes request (nodes=1:ppn=20:gpus=1). 20 nodes are equipped with Intel Xeon Phi (MIC) accelerators (one MIC with each node). These nodes can be requested by adding mics=1 to your nodes request (nodes=1:ppn=20:mics=1)

Walltime Limit

Here are the queues available on Ruby:

NAME

MAX WALLTIME

MAX JOB SIZE

NOTES

Serial

168 hours

1 node

 

Parallel

96 hours

40 nodes

 

Hugemem

48 hours

1 node

32 core with 1 TB RAM

Debug

1 hour

6 nodes

16 core with 128GB RAM

Job Limit

An individual user can have up to 40 concurrently running jobs and/or up to 800 processors/cores in use. All the users in a particular group/project can among them have up to 80 concurrently running jobs and/or up to 1600 processors/cores in use if the system is busy. Debug queue is 1 job at a time per user. For Condo users, please contact OSC Help for more instructions.

Supercomputer: 

Citation

To cite Ruby, please use the following Archival Resource Key:

ark:/19495/hpc93fc8

Here is the citation in BibTeX format:

@article{Ruby2015,
ark = {ark:/19495/hpc93fc8},
url = {http://osc.edu/ark:/19495/hpc93fc8},
year  = {2015},
author = {Ohio Supercomputer Center},
title = {Ruby supercomputer}
}

And in EndNote format:

%0 Generic
%T Ruby supercomputer
%A Ohio Supercomputer Center
%R ark:/19495/hpc93fc8
%U http://osc.edu/ark:/19495/hpc93fc8
%D 2015
Supercomputer: 

Request Access

Projects who would like to use the Ruby cluster will need to request access.  This is because of the peticulars of the Ruby envionment, which includes its size, MICs, GPUs, and scheduling policies.  

Motivation

Access to Ruby is done on a case by case basis because:

  • It is a smaller machine than Oakley or Glenn, and thus has limited space for users
    • Oakley has 694 nodes, while Ruby only has 240 nodes.
  • it's CPUs are less general, and therefore more consideration is required to get optimal performance
  • Scheduling is done on a per-node basis, and therefore jobs must scale to this level at a bare minimum 
  • additional consideration is required to get full performance out of its MICs and GPUs

Good Ruby Workload Characteristics

Those interested in using Ruby should check that their work is well suited for it by using the following list.  Ideal workloads will exhibit one or more of the following characteristics:

  • Work scales well to large core counts
    • No single core jobs
    • Scales well past 2 nodes on Oakley
  • Needs access to Ruby specific hardware (MICs or GPUs)
  • Memory bound work
  • Software:
    • Supports MICs or GPUs
    • Takes advantage of:
      • Long vector length
      • Higher core count
      • Improved Memory Bandwidth

Applying for Access

Those who would like to be considered for Ruby access should send the following in a email to OSC Help:

  • Name
  • Project ID
  • Plan for using Ruby
  • Evidence of workload being well suited for Ruby