TensorFlow

"TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code."

Quote from TensorFlow Github documentation

Availability and Restrictions

Versions

The following version of TensorFlow is available on OSC clusters:

Version Owens Pitzer Note CUDA version
compatibility
1.3.0 X   python/3.6 8 or later
1.9.0 X* X* python/3.6-conda5.2 9 or later
2.0.0 X X python/3.7-2019.10 10.0 or later
 

TensorFlow is a Python package and therefore requires loading corresonding python modules (see Note). The version of TensorFlow may actively change with updates to Anaconda Python on Owens so that you can check the latest version with conda list tensorflow. The available versions of TensorFlow on Owens and Pitzer require CUDA for GPU calculations. You can find and load compatible cuda module via

module load python/3.6-conda5.2
module spider cuda
module load cuda/9.2.88

If you would like to use a different version of TensorFlow, please follow this installation guide which describes how to install python packages locally. 

https://www.osc.edu/resources/getting_started/howto/howto_install_tensorflow_locally

Newer version of TensorFlow might require newer version of CUDA. Please refer to https://www.tensorflow.org/install/source#gpu for a up-to-date compatibility chart.

Feel free to contact OSC Help if you have any issues with installation.

Access 

TensorFlow is available to all OSC users. If you have any questions, please contact OSC Help.

Publisher/Vendor/Repository and License Type

https://www.tensorflow.org, Open source

Usage on Owens

Usage on Owens

Setup on Owens

TensorFlow package is installed using Anaconda Python 2.  To configure the Owens cluster for the use of TensorFlow, use the following commands:

module load python/3.6 cuda/8.0.44

Batch Usage on Ruby or Owens

Batch jobs can request multiple nodes/cores and compute time up to the limits of the OSC systems. Refer to Queues and Reservations for Owens, and Scheduling Policies and Limits for more info.  In particular, TensorFlow should be run on a GPU-enabled compute node.

An Example of Using  TensorFlow with MNIST model and Logistic Regression

Below is an example batch script (job.txt and logistic_regression_on_mnist.py) for using TensorFlow.

Contents of job.txt

#!/bin/bash
#SBATCH --job-name ExampleJob
#SBATCH --nodes=2 --ntasks-per-node=28 --gpus-per-node=1
#SBATCH --time=01:00:00
 

cd $PBS_O_WORKDIR

module load python/3.6 cuda/8.0.44
python logistic_regression_on_mnist.py

Contents of logistic_regression_on_mnist.py

# logistic_regression_on_mnist.py Python script based on:
# https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/0_Prerequisite/mnist_dataset_intro.ipynb
# https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/2_BasicModels/logistic_regression.ipynb

import tensorflow as tf

# Import MNIST
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("data/", one_hot=True)

# Parameters
learning_rate = 0.01
training_epochs = 25
batch_size = 100
display_step = 1

# tf Graph Input
x = tf.placeholder(tf.float32, [None, 784]) # mnist data image of shape 28*28=784
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 digits recognition => 10 classes

# Set model weights
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

# Construct model
pred = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax

# Minimize error using cross entropy
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            # Fit training using batch data
            _, c = sess.run([optimizer, cost], feed_dict={x: batch_xs,
                                                          y: batch_ys})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if (epoch+1) % display_step == 0:
            print ("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost))

    print ("Optimization Finished!")

    # Test model
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    # Calculate accuracy for 3000 examples
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print ("Accuracy:", accuracy.eval({x: mnist.test.images[:3000], y: mnist.test.labels[:3000]}))

In order to run it via the batch system, submit the job.txt  file with the following command:

sbatch job.txt

 

Distributed Tensorflow

Tensorflow can be configured to run parallel using Horovod package from uber.

Further Reading

TensorFlow homepage

 
Supercomputer: 
Service: 
Technologies: 
Fields of Science: