R and Rstudio

R is a language and environment for statistical computing and graphics. It is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes

  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • graphical facilities for data analysis and display either on-screen or on hardcopy, and
  • a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input, and output facilities

More information can be found here.

Availability and Restrictions

Versions

The following versions of R are available on OSC systems: 

Version Owens PITZER
3.3.2 X  
3.4.0  X  
3.4.2 X  
3.5.0# X*  
3.5.1 X  
3.5.2   X*
3.6.0 or 3.6.0-gnu7.3 X X
3.6.1 or 3.6.1-gnu9.1 X  
3.6.3 or 3.6.3-gnu9.1 X X
4.0.2 or 4.0.2-gnu9.1 X X

 

* Current default version R/3.5.0 is available for both intel/16 and intel/18, but they may differ for R packages under them. R/3.6.0 and later versions are compiled with gnu and mkl. Loading R/3.6.X modules require dependencies to be preloaded whereas R/3.6.X-gnuY modules will automatically load required dependencies.

You can use module avail R to view available modules and module spider R/version to show how to load the module for a given machine. Feel free to contact OSC Help if you need other versions for your work.

Access

R is available to all OSC users. If you have any questions, please contact OSC Help.

Publisher/Vendor/Repository and License Type

R Foundation, Open source

Usage

R software can be launched two different ways; through Rstudio on OSC OnDemand and through the terminal.

Rstudio

In order to access Rstudio and OSC R workshop materials, please visit here.

Terminal Acess

In order to configure your environment for R, run the following command:

module load R/version
#for example,
module load R/3.6.3-gnu9.1

R/3.6.0 and onwards versions use gnu compiler and intel mkl libraries for performance improvements. Loading R/3.6.X modules require dependencies to be preloaded as below whereas R/3.6.X-gnuY modules will automatically load required dependencies.

Using R

Once your environment is configured, R can be started simply by entering the following command:

R

For a listing of command line options, run:

R --help

Running R interactively on a login node for extended computations is not recommended and may violate OSC usage policy. Users can either request compute nodes to run R interactively or run R in batch.

Running R interactively on terminal:

Request compute node or nodes if running parallel R as,

qsub -I -l nodes=1:ppn=28 -A your_project_id -l walltime=01:00:00

When the compute node is ready, launch R by loading modules

module load R/3.6.3-gnu9.1
R

Batch Usage

 Reference the example batch script below. This script requests one full node on the Owens cluster for 1 hour of wall time.

#PBS -N R_ExampleJob
#PBS -l nodes=1:ppn=28
#PBS -l walltime=01:00:00
#PBS -A your_project_id

module load R/3.6.3-gnu9.1

cd $PBS_O_WORKDIR
cp in.dat test.R $TMPDIR
cd $TMPDIR

R CMD BATCH test.R test.Rout

cp test.Rout $PBS_O_WORKDIR

HOWTO: Install Local R Packages

R comes with a single library  $R_HOME/library which contains the standard and recommended packages. This is usually in a system location. On Owens, it is  /usr/local/R/gnu/9.1/3.6.3/lib64/R  for R/3.6.3. OSC also installs popular R packages into the site located at /usr/local/R/gnu/9.1/3.6.3/site/pkgs for R/3.6.3 on Owens. 

Users can check the library path as follows after launching an R session;

> .libPaths()
[1] "/users/PZS0680/soottikkal/R/x86_64-pc-linux-gnu-library/3.6"
[2] "/usr/local/R/gnu/9.1/3.6.3/site/pkgs"                       
[3] "/usr/local/R/gnu/9.1/3.6.3/lib64/R/library"

Users can check the list of available packages as follows;

>installed.packages()

To install local R packages, use install.package() command. For example,

>install.packages("lattice")

 For the first time local installation, it will give a warning as follows,

Installing package into ‘/usr/local/R/gnu/9.1/3.6.3/site/pkgs’
(as ‘lib’ is unspecified)
Warning in install.packages("lattice") :
 'lib = "/usr/local/R/gnu/9.1/3.6.3/site/pkgs"' is not writable
Would you like to use a personal library instead? (yes/No/cancel)

Answer , and it will create the directory and install the package there.

Installing Packages from GitHub

Users can install R packages directly from Github using devtools package as follows

>install.packages("devtools")
>devtools::install_github("author/package")

Installing Packages from Bioconductor

Users can install R packages directly from Bioconductor using BiocManager.

>install.packages("BiocManager")
>BiocManager::install(c("GenomicRanges", "Organism.dplyr"))

renv: Package Manager

if you are using R for multiple projects, OSC recommendsrenv, an R dependency manager for R package management. Please see more information here.

The renv package helps you create reproducible environments for your R projects. Use renv to make your R projects more:

  • Isolated: Each project gets its own library of R packages, so you can feel free to upgrade and change package versions in one project without worrying about breaking your other projects.

  • Portable: Because renv captures the state of your R packages within a lockfile, you can more easily share and collaborate on projects with others, and ensure that everyone is working from a common base.

  • Reproducible: Use renv::snapshot() to save the state of your R library to the lockfile renv.lock. You can later use renv::restore() to restore your R library exactly as specified in the lockfile.

Users can install renv package as follows;

>install.packages("renv")

The core essence of the renv workflow is fairly simple:

1. After launching R, go to your project directory using R command setwd and initiate renv

>setwd("your/project/path")
>renv::init()

This function forks the state of your default R libraries into a project-local library. A project-local .Rprofile is created (or amended), which is then used by new R sessions to automatically initialize renv and ensure the project-local library is used. 

Work in your project as usual, installing and upgrading R packages as required as your project evolves.

2. Use renv::snapshot() to save the state of your project library. The project state will be serialized into a file called renv.lock under your project path.

3. Use renv::restore() to restore your project library from the state of your previously-created lockfile renv.lock.

In short: use renv::init() to initialize your project library, and use renv::snapshot() / renv::restore() to save and load the state of your library.

After your project has been initialized, you can work within the project as before, but without fear that installing or upgrading packages could affect other projects on your system.

Global Cache

One of renv’s primary features is the use of a global package cache, which is shared across all projects using renvWhen using renv the packages from various projects are installed to the global cache. The individual project library is instead formed as a directory of symlinks  into the renv global package cache. Hence, while each renv project is isolated from other projects on your system, they can still re-use the same installed packages as required. By default, global Cache of renv is located ~/.local/share/renvUser can change the global cache location using RENV_PATHS_CACHE variable. Please see more information here.

Please note that renv does not load packages from site location (add-on packages installed by OSC) to the rsession. Users will have access to the base R packages only when using renv. All other packages required for the project should be installed by the user.

Version Control with renv

If you would like to version control your project, you can utilize git versioning of renv.lock file. First, initiate git for your project directory on a terminal

git init

Continue working on your R project by launching R, installing packages, saving snapshot using renv::snapshot()command. Please note that renv::snapshot() will only save packages that are used in the current project. To capture all packages within the active R libraries in the lockfile, please see the type option. 

>renv::snapshot(type="simple") 

If you’re using a version control system with your project, then as you call renv::snapshot() and later commit new lockfiles to your repository, you may find it necessary later to recover older versions of your lockfiles. renv provides the functions renv::history()to list previous revisions of your lockfile, and renv::revert() to recover these older lockfiles.

If you are using renvpackage for the first time, it is recommended that you check R startup files in your $HOME such as .Rprofile and .Renviron and remove any project-specific settings from these files. Please also make sure you do not have any project-specific settings in ~/.R/Makevars.

A Simple Example

First, you need to load the module for R and fire up R session

module load R/3.6.3-gnu9.1
R

Then set the working directory and initiate renv

>setwd("your/project/path")
>renv::init()

Let's install a package called  lattice,  and save the snapshot to the renv.lock

> renv::install("lattice")
> renv::snapshot(type="simple") 

The latticepackage will be installed in global cache of renv and symlink will be saved in renv under the project path.

Restore a Project

Use renv::restore() to restore a project's dependencies from a lockfile, as previously generated by snapshot(). Let's remove the lattice package.

> renv::remove("lattice")

Now let's restore the project from the previously saved snapshot so that the lattice package is restored.

> renv::restore()
>library(lattice)

Collaborating with renv

When using renv, the packages used in your project will be recorded into a lockfile, renv.lock. Because renv.lock records the exact versions of R packages used within a project, if you share that file with your collaborators, they will be able to use renv::restore() to install exactly the same R packages as recorded in the lockfile. Please find more information here.

Parallel R

R provides a number of methods for parallel processing of the code. Multiple cores and nodes available on OSC clusters can be effectively deployed to run many computations in R faster through parallelism.

Consider this example, where we use a function that will generate values sampled from a normal distribution and sum the vector of those results; every call to the function is a separate simulation.

myProc <- function(size=1000000) {
  # Load a large vector
  vec <- rnorm(size)
  # Now sum the vec values
  return(sum(vec))
}

Serial execution with loop

Let’s first create a serial version of R code to run myProc() 100x on Owens

tick <- proc.time()
for(i in 1:100) {
  myProc()
}
tock <- proc.time() - tick
tock
##    user  system elapsed 
##   6.437   0.199   6.637

Here, we execute each trial sequentially, utilizing only one of our 28 processors on this machine. In order to apply parallelism, we need to create multiple tasks that can be dispatched to different cores. Using apply() family of R function, we can create multiple tasks. We can rewrite the above code  to use apply(), which applies a function to each of the members of a list (in this case the trials we want to run):

tick <- proc.time()
result <- lapply(1:100, function(i) myProc())
tock <-proc.time() - tick
tock
##    user  system elapsed 
##   6.346   0.152   6.498

parallel package

The  parallellibrary can be used to dispatch tasks to different cores. The parallel::mclapply function can distributes the tasks to multiple processors.

library(parallel)
cores <- system("nproc")
tick <- proc.time()
result <- mclapply(1:100, function(i) myProc(), mc.cores=cores)
tock <- proc.time() - tick
tock
##    user  system elapsed 
##   8.653   0.457   0.382

foreach package

The foreach package provides a  looping construct for executing R code repeatedly. It uses the sequential %do% operator to indicate an expression to run.

library(foreach)
tick <- proc.time()
result <-foreach(i=1:100) %do% {
   myProc()
}  
tock <- proc.time() - tick
tock
##    user  system elapsed 
##   6.420   0.018   6.439

doParallel package

foreach supports a parallelizable operator %dopar% from the doParallel package. This allows each iteration through the loop to use different cores.

library(doParallel, quiet = TRUE)
library(foreach)
cl <- makeCluster(28)
registerDoParallel(cl)
 
tick <- proc.time()
result <- foreach(i=1:100, .combine=c) %dopar% {
    myProc()
}
tock <- proc.time() - tick
tock
invisible(stopCluster(cl)) 
detachDoParallel()
##    user  system elapsed 
##   0.085   0.013   0.446

Using Rmpi package

Rmpi package allows to parallelize R code across multiple nodes. Rmpi provides an interface necessary to use MPI for parallel computing using R. This allows each iteration through the loop to use different cores on different nodes. Rmpijobs cannot be run with RStudio at OSC currently, instead users can submit Rmpi jobs through terminal App. R/3.6.3 uses openmpi as MPI interface.

Above example code can be rewritten to utilize multiple nodes with Rmpias follows;

library(Rmpi)
library(snow)
workers <- as.numeric(Sys.getenv(c("PBS_NP")))-1
cl <- makeCluster(workers, type="MPI") # MPI tasks to use 
clusterExport(cl, list('myProc'))
tick <- proc.time()
result <- clusterApply(cl, 1:100, function(i) myProc())
write.table(result, file = "foo.csv", sep = ",")
tock <- proc.time() - tick
tock

Batch script for job submission is as follows,

#!/bin/bash
#PBS -l walltime=10:00
#PBS -l nodes=2:ppn=28
#PBS -j oe

cd $PBS_O_WORKDIR

module load R/3.6.3-gnu9.1 openmpi/1.10.7

# parallel R: submit job with one MPI master
mpirun -np 1 R --slave < Rmpi.R

Profiling R code

Profiling R code helps to optimize the code by identifying bottlenecks and improve its performance. There are a number of tools that can be used to profile R code.

Grafana: 

OSC jobs can be monitored for CPU and memory usage using grafana.  If your job is in running status, you can get grafana metrics as follows. After log in to OSC OnDemand, select Jobs from the top tabs, then select Active Jobs and then Job that you are interested to profile. You will see grafana metrics at the bottom of the page and you can click on detailed metrics to access more information about your job at grafana.

Screen Shot 2020-07-29 at 3.33.14 PM.png

Rprof:

R’s built-in tool,Rprof function can be used to profile R expressions and the summaryRprof function to summarize the result. More information can be found here.

Here is an example of profiling R code with Rprofe for data analysis on Faithful data.

Rprof("Rprof-out.prof",memory.profiling=TRUE, line.profiling=TRUE)
  data(faithful)
  summary(faithful)
  plot(faithful)
Rprof(NULL)

To analyze profiled data, runsummaryRprof on Rprof-out.prof

summaryRprof("Rprof-out.prof")

You can read more about summaryRprofhere

Profvis:

 It provides an interactive graphical interface for visualizing data from Rprof.

library(profvis)
profvis({
    data(faithful)
    summary(faithful)
    plot(faithful)
},prof_output="profvis-out.prof")

If you are running the R code on Rstudio, it will automatically open up the visualization for the profiled data. More info can be found here.

Using Rstudio for classroom  

OSC provides an isolated and custom R environment for each classroom project that requires Rstudio. More information can be found here.

Further Reading

See Also

 

 

Supercomputer: 
Service: 
Fields of Science: