Hadoop

A hadoop cluster can be launched within the HPC environment, but managed by the PBS job scheduler using  Myhadoop framework developed by San Diego Supercomputer Center. (Please see http://www.sdsc.edu/~allans/MyHadoop.pdf)

Availability and Restrictions

Versions

The following versions of Hadoop are available on OSC systems: 

VERSION

OWENS

3.0.0-alpha1

X*

* Current default version

You can use module spider hadoop to view available modules for a given machine. Feel free to contact OSC Help if you need other versions for your work.

Access

Hadoop is available to all OSC users. If you have any questions, please contact OSC Help.

Publisher/Vendor/Repository and License Type

Apache software foundation, Open source

Usage

Set-up

In order to configure your environment for the usage of Hadoop, run the following command:

module load hadoop

In order to access a particular version of Hadoop, run the following command

module load hadoop/3.0.0-alpha1

Using Hadoop

In order to run Hadoop in batch, reference the example batch script below. This script requests 6 node on the Owens cluster for 1 hour of walltime. 

#PBS -N hadoop-example

#PBS -l nodes=6:ppn=12

#PBS -l walltime=01:00:00

setenv WORK $PBS_O_WORKDIR

module load hadoop/3.0.0-alpha1

module load myhadoop/v0.40

setenv HADOOP_CONF_DIR $TMPDIR/mycluster-conf-$PBS_JOBID

cd $TMPDIR

myhadoop-configure.sh -c $HADOOP_CONF_DIR -s $TMPDIR

$HADOOP_HOME/sbin/start-dfs.sh

hadoop dfsadmin -report

hadoop  dfs -mkdir data

hadoop  dfs -put $HADOOP_HOME/README.txt  data/

hadoop  dfs -ls data

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-alpha1.jar wordcount data/README.txt wordcount-out

hadoop  dfs -ls wordcount-out

hadoop  dfs  -copyToLocal -f  wordcount-out  $WORK

$HADOOP_HOME/sbin/stop-dfs.sh

myhadoop-cleanup.sh

Example Jobs

Please check /usr/local/src/hadoop/3.0.0-alpha1/test.osc folder for more examples of hadoop jobs

Further Reading

See Also

Supercomputer: 
Service: 
Fields of Science: