checkgpu Command

Introduction

checkgpu is a command developed at OSC for use on OSC’s systems to report information regarding the usage of OSC’s GPU nodes. It reports various information on the usage and availability of the GPU nodes.

Availability

owens PITZER
X

X

 

Usage

checkgpu takes the following options and parameters (viewable by passing  -h/--help)

$ checkgpu -h
usage: checkgpu.py [-h] [-j] [-r] [-q] [-n] [-u] [-a] [-v]

optional arguments:
  -h, --help     show this help message and exit
  -j, --jobs     Check status of gpu jobs
  -r, --run      Check status of running gpu jobs (use with --jobs)
  -q, --queued   Check status of queued gpu jobs (use with --jobs)
  -n, --node     Check status of gpu nodes
  -u, --used     Check status of used gpu nodes (use with --node)
  -a, --avail    Check status of available gpu nodes (use with --node)
  -v, --verbose  Display full job report

View information on GPU jobs

By using the flag -j/--jobs, checkgpu can report summaries for the total number of jobs using GPU nodes

$ checkgpu -j
*** SUMMARY ***
===================================================
Summary of Running GPU Jobs:

152 GPU jobs running using 152 GPU nodes. Among them:
 * 146 GPU jobs running using 146 GPU nodes from non-condo and non-debugging jobs
 * 0 GPU jobs running using 0 GPU nodes from debugging jobs
 * 6 GPU Jobs running using 6 GPU nodes from PCONXXXX
===================================================
Summary of Queued GPU Jobs:

485 GPU Jobs queued requesting 492 GPU nodes. Among them:
 * 382 GPU Jobs queued requesting 389 GPU nodes from non-condo and non-debugging jobs
 * 0 GPU Jobs queued requesting 0 debug GPU nodes
 * 103 GPU Jobs queued requesting 103 GPU nodes from PCONXXXX
===================================================

View information on GPU jobs (run-only)

By using the flag -r/--run in tandem with -j, checkgpu can report summaries for the total number of running jobs using GPU nodes

$ checkgpu -j --run
===================================================
Summary of Running GPU Jobs:
152 GPU jobs running using 152 GPU nodes. Among them:
 * 146 GPU jobs running using 146 GPU nodes from non-condo and non-debugging jobs
 * 0 GPU jobs running using 0 GPU nodes from debugging jobs
 * 6 GPU Jobs running using 6 GPU nodes from PCONXXXX

View information on GPU jobs (queue-only)

By using the flag-q/--queued in tandem with -j, checkgpu can report summaries for the total number of queued jobs using GPU nodes

$ checkgpu -j --queued
===================================================
Summary of Queued GPU jobs:
484 GPU Jobs queued requesting 491 GPU nodes. Among them:
 * 382 GPU Jobs queued requesting 389 GPU nodes from non-condo and non-debugging jobs
 * 0 GPU Jobs queued requesting 0 debug GPU nodes
 * 102 GPU Jobs queued requesting 102 GPU nodes from PCONXXXX

View information on GPU nodes

By using the flag -n/--node, checkgpu can report summaries for the usage of OSC’s GPU nodes

$ checkgpu --node
***USED NODES***
================================================================
================================================================
Summary of the current GPU nodes:
153 total jobs on gpu nodes
 * 153 GPU jobs using 153 GPU nodes. Among them:
  ** 147 GPU nodes used by non-condo and non-debug jobs
  ** 0 GPU nodes used by debug jobs
  ** 6 GPU nodes used by condo group PCONXXXX
 * 0 non-GPU jobs using 0 GPU nodes. Among them:
  ** 0 GPU nodes used by non-condo and non-debug non-gpu jobs
  ** 0 GPU nodes used by debug non-gpu jobs
  ** 0 GPU nodes used by non-gpu condo jobs

***AVAILABLE NODES***
================================================================
================================================================
Summary of the available GPU nodes:
144 available gpu nodes. Among them: 
 * 4 fully available gpu nodes.
 * 140 partly available gpu nodes.
 * 0 open gpus on partly available nodes.

View information on GPU nodes (used-only)

By using the flag -u/--used in tandem with -n, checkgpu can report summaries for the usage of used GPU nodes

$ checkgpu -n -u
================================================================
Summary of the current GPU nodes:
152 total jobs on gpu nodes
 * 152 GPU jobs using 152 GPU nodes. Among them:
  ** 147 GPU nodes used by non-condo and non-debug jobs
  ** 0 GPU nodes used by debug jobs
  ** 5 GPU nodes used by condo group PCONXXXX
 * 0 non-GPU jobs using 0 GPU nodes. Among them:
  ** 0 GPU nodes used by non-condo and non-debug non-gpu jobs
  ** 0 GPU nodes used by debug non-gpu jobs
  ** 0 GPU nodes used by non-gpu condo jobs

View information on GPU nodes (available-only)

By using the flag -a/--avail in tandem with -n, checkgpu can report summaries for the availability of GPU nodes

$ chekgpu -n -a
================================================================
Summary of the available GPU nodes:
144 available gpu nodes. Among them: 
 * 4 fully available gpu nodes.
 * 140 partly available gpu nodes.
 * 0 open gpus on partly available nodes.

View full job summary

By using the flag -v/--verbose in tandem with any of the above command options, checkgpu will report full summaries on the usage of GPU nodes

$ checkgpu --jobs --run -v
============================================
Job Details:
List of non-condo and non-debug GPU jobs
XXXXXXX.owens-batch.ten.osc.edu
o0678-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0673-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0695-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0805-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0763-gpu/0
...

List of debug GPU jobs

List of GPU jobs from PCON0005
XXXXXXX.owens-batch.ten.osc.edu
o0732-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0779-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0745-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0712-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0663-gpu/0
XXXXXXX.owens-batch.ten.osc.edu
o0750-gpu/0

===================================================
Summary of Running GPU Jobs:
153 GPU jobs running using 153 GPU nodes. Among them:
 * 147 GPU jobs running using 147 GPU nodes from non-condo and non-debugging jobs
 * 0 GPU jobs running using 0 GPU nodes from debugging jobs
 * 6 GPU Jobs running using 6 GPU nodes from PCON0005
Supercomputer: 
Service: