Spark

Apache Spark is an open source cluster-computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop's disk-based analytics paradigm, Spark has multi-stage in-memory analytics. Spark can run programs upto 100x faster than Hadoop’s MapReduce in memory or 10x faster on disk. Spark support applications written in python, java, scala and R

Availability and Restrictions

Versions

The following versions of Spark are available on OSC systems: 

VERSION

OAKLEY

OWENS

1.5.2

X

 

1.6.1

X

 

2.0.0

 X*

 X*

2.1.0

 

X

2.3.0   X
* Current default version

You can use module spider spark to view available modules for a given machine. Feel free to contact OSC Help if you need other versions for your work.

Access

Spark is available to all OSC users without restriction.

Publisher/Vendor/Repository and License Type

The Apache Software Foundation, Open source

Usage

Set-up

In order to configure your environment for the usage of Spark, run the following command:

module load spark

A particular version of Spark can be loaded as follows

module load spark/2.3.0

Using Spark

 In order to run Spark in batch, reference the example batch script below. This script requests 6 node on the Owens cluster for 1 hour of walltime. The script will submit the pyspark script called test.py using pbs-spark-submit command into the PBS queue. 

#PBS -N Spark-example

#PBS -l nodes=6:ppn=28

#PBS -l walltime=01:00:00

module load spark

cd $PBS_O_WORKDIR

cp test.py $TMPDIR

cd $TMPDIR 

pbs-spark-submit test.py  > test.log

cp * $PBS_O_WORKDIR

pbs-spark-submit script is used for submitting Spark jobs into PBS queue. For more options, please run,

pbs-spark-submit --help

Running Spark interactively in batch

To run Spark interactively, but in batch on Owens please run the following command,

qsub -I -l nodes=4:ppn=28 -l walltime=01:00:00

When your interactive shell is ready, please launch spark cluster using the pbs-spark-submit script

pbs-spark-submit

You can then launch pyspark  by connecting to Spark master node as follows.

pyspark --master spark://nodename.ten.osc.edu:7070

Launching Jupyter+Spark on OSC OnDemand

 

Instructions on how to launch Spark on OSC OnDemand web interface is here. https://www.osc.edu/content/launching_jupyter_spark_app  

Example Jobs

Please check /usr/local/src/spark/2.0.0/test.osc folder for more examples of pyspark script, job submission script and output files.

Further Reading

See Also

Supercomputer: 
Service: 
Fields of Science: