Apache Spark is an open source cluster-computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop's disk-based analytics paradigm, Spark has multi-stage in-memory analytics. Spark can run programs up-to 100x faster than Hadoop’s MapReduce in memory or 10x faster on disk. Spark support applications written in python, java, scala and R
Availability and Restrictions
Versions
The following versions of Spark are available on OSC systems:
Version | Owens | Pitzer | Note |
---|---|---|---|
2.0.0 | X* | Only support Python 3.5 | |
2.1.0 | X | Only support Python 3.5 | |
2.3.0 | X | ||
2.4.0 | X | X* | |
2.4.5 | X | X |
You can use module spider spark
to view available modules for a given machine. Feel free to contact OSC Help if you need other versions for your work.
Access
Spark is available to all OSC users. If you have any questions, please contact OSC Help.
Publisher/Vendor/Repository and License Type
The Apache Software Foundation, Open source
Usage
Set-up
In order to configure your environment for the usage of Spark, run the following command:
module load spark
A particular version of Spark can be loaded as follows
module load spark/2.3.0
Using Spark
In order to run Spark in batch, reference the example batch script below. This script requests 6 node on the Owens cluster for 1 hour of walltime. The script will submit the pyspark script called test.py using pbs-spark-submit command.
#!/bin/bash #SBATCH --job-name ExampleJob #SBATCH --nodes=2 --ntasks-per-node=48 #SBATCH --time=01:00:00 #SBTACH --account your_project_id module load spark cp test.py $TMPDIR cd $TMPDIR pbs-spark-submit test.py > test.log cp * $SLURM_SUBMIT_DIR
pbs-spark-submit script is used for submitting Spark jobs. For more options, please run,
pbs-spark-submit --help
Running Spark interactively in batch
To run Spark interactively, but in batch on Owens please run the following command,
sinteractive -N 2 -n 28 -t 01:00:00
When your interactive shell is ready, please launch spark cluster using the pbs-spark-submit script
pbs-spark-submit
You can then launch pyspark by connecting to Spark master node as follows.
pyspark --master spark://nodename.ten.osc.edu:7070
Launching Jupyter+Spark on OSC OnDemand
Instructions on how to launch Spark on OSC OnDemand web interface is here. https://www.osc.edu/content/launching_jupyter_spark_app
Example Jobs
Please check /usr/local/src/spark/2.0.0/test.osc folder for more examples of pyspark script, job submission script and output files.