Spark

Apache Spark is an open source cluster-computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop's disk-based analytics paradigm, Spark has multi-stage in-memory analytics. Spark can run programs upto 100x faster than Hadoop’s MapReduce in memory or 10x faster on disk. Spark support applications written in python, java, scala and R

Availability & Restrictions

Spark is available to all OSC users without restriction.

The following versions of Spark are available on OSC systems: 

VERSION

OAKLEY

OWENS

1.5.2

X

 

1.6.1

X

 

2.0.0*

X

X

 

NOTE: * means it is the default version.

Set-up

In order to configure your environment for the usage of Spark, run the following command:

module load spark

In order to access a particular version of Spark, run the following command

module load spark/2.0.0

Using Spark

 In order to run Spark in batch, reference the example batch script below. This script requests 6 node on the Owens cluster for 1 hour of walltime. The script will submit the pyspark script called test.py using pbs-spark-submit command into the PBS queue. 

#PBS -N Spark-example

#PBS -l nodes=6:ppn=28

#PBS -l walltime=01:00:00

module load spark

cd $PBS_O_WORKDIR

cp test.py $TMPDIR

cd $TMPDIR 

pbs-spark-submit test.py  > test.log

cp * $PBS_O_WORKDIR

pbs-spark-submit script is used for submitting Spark jobs into PBS queue. For more options, please run,

pbs-spark-submit --help

Running Spark interactively in batch

To run Spark interactively, but in batch on Owens please run the following command,

qsub -I -l nodes=4:ppn=28 -l walltime=01:00:00

When your interactive shell is ready, please launch spark cluster using the pbs-spark-submit script

pbs-spark-submit

You can then launch the interface for pyspark  as follows,

pyspark --master spark://nodename.ten.osc.edu:7070

Example Jobs

Please check /usr/local/src/spark/2.0.0/test.osc folder for more examples of pyspark script, job submission script and output files.

Further Reading

See Also

Service: 
Fields of Science: