Our HOWTO collection contains short tutorials that help you step through some of the common (but potentially confusing) tasks users may need to accomplish, that do not quite rise to the level of requiring more structured training materials. Items here may explain a procedure to follow, or present a "best practices" formula that we think may be helpful.
The XDMoD tool at xdmod.osc.edu can be used to get an overview of how accurate the requested time of jobs are with the elapsed time of jobs.
One way of specifying a time request is:
#SBATCH --time=xx:xx:xx
The elapsed time is how long the job ran for before completing. This can be obtained using the sacct
command.
$ sacct -u <username> --format=jobid,account,elapsed
It is important to understand that the requested time is used when scheduling a submitted job. If a job requests a time that is much more than the expected elapsed time, then it may take longer to start because the resources need to be allocated for the time that the job requests even if the job only uses a small portion of that requested time.
This allows one to view the requested time accuracy for an individual job, but XDMoD can be used to do this for jobs submitted in over a time range.
First, login to xdmod.osc.edu, see this page for more instructions.
https://www.osc.edu/supercomputing/knowledge-base/xdmod_tool
Then, navigate to the Metric Explorer tab.
Look for the Metric Catalog on the left side of the page and expand the SUPREMM options. Select Wall Hours: Requested: Per Job and group by None.
This will now show the average time requested.
The actual time data can be added by navigating to Add Data -> SUPREMM -> Wall Hours: Per Job.
This will open a new window titled Data Series Definition, to change some parameters before showing the new data. In order to easily distinguish between elapsed and requested time, change the Display Type to Bar, then click add to view the new data.
Now there is a line which shows the average requested time of jobs, and bars which depict the average elapsed time of jobs. Essentialy, the closer the bar is to the line, without intersecting the line, the more accurate the time predicition. If the bar intersects the line, then it may indicate the there was not enough time requested for a job to complete, but remember that these values are averages.
One can also view more detailed information about these jobs by clicking a data point and using the Show raw data option.
This page outlines ways to generate and view performance data for your program using tools available at OSC.
This section describes how to use performance tools from Intel. Make sure that you have an Intel module loaded to use these tools.
Intel VTune is a tool to generate profile data for your application. Generating profile data with Intel VTune typically involves three steps:
You need executables with debugging information to view source code line detail: re-compile your code with a -g
option added among the other appropriate compiler options. For example:
mpicc wave.c -o wave -g -O3
Profiles are normally generated in a batch job. To generate a VTune profile for an MPI program:
mpiexec <mpi args> amplxe-cl <vtune args> <program> <program args>
where <mpi args>
represents arguments to be passed to mpiexec, <program>
is the executable to be run, <vtune args>
represents arguments to be passed to the VTune executable amplxe-cl, and <program args>
represents arguments passed to your program.
For example, if you normally run your program with mpiexec -n 12 wave_c
, you would use
mpiexec -n 12 amplxe-cl -collect hotspots -result-dir r001hs wave_c
To profile a non-MPI program:
amplxe-cl <vtune args> <program> <program args>
The profile data is saved in a .map file in your current directory.
As a result of this step, a subdirectory that contains the profile data files is created in your current directory. The subdirectory name is based on the -result-dir argument and the node id, for example, r001hs.o0674.ten.osc.edu
.
3. Analyze your profile data.
You can open the profile data using the VTune GUI in interactive mode. For example:
amplxe-gui r001hs.o0674.ten.osc.edu
One should use an OnDemand VDI (Virtual Desktop Interface) or have X11 forwarding enabled (see Setting up X Windows). Note that X11 forwarding can be distractingly slow for interactive applications.
Intel Trace Analyzer and Collector (ITAC) is a tool to generate trace data for your application. Generating trace data with Intel ITAC typically involves three steps:
You need to compile your executbale with -tcollect
option added among the other appropriate compiler options to insert instrumentation probes calling the ITAC API. For example:
mpicc wave.c -o wave -tcollect -O3
mpiexec -trace <mpi args> <program> <program args>
For example, if you normally run your program with mpiexec -n 12 wave_c
, you would use
mpiexec -trace -n 12 wave_c
As a result of this step, .anc, .f, .msg, .dcl, .stf, and .proc files will be generated in your current directory.
You will need to use traceanalyzer
to view the trace data. To open Trace Analyzer:
traceanalyzer /path/to/<stf file>
where the base name of the .stf file will be the name of your executable.
One should use an OnDemand VDI (Virtual Desktop Interface) or have X11 forwarding enabled (see Setting up X Windows) to view the trace data. Note that X11 forwarding can be distractingly slow for interactive applications.
Intel's Application Performance Snapshot (APS) is a tool that provides a summary of your application's performance . Profiling HPC software with Intel APS typically involves four steps:
Regular executables can be profiled with Intel APS. but source code line detail will not be available. You need executables with debugging information to view source code line detail: re-compile your code with a -g
option added among the other approriate compiler options. For example:
mpicc wave.c -o wave -tcollect -O3
Profiles are normally generated in a batch job. To generate profile data for an MPI program:
mpiexec -trace <mpi args> <program> <program args>
where <mpi args>
represents arguments to be passed to mpiexec, <program>
is the executable to be run and <program args>
represents arguments passed to your program.
For example, if you normally run your program with mpiexec -n 12 wave_c
, you would use
mpiexec -n 12 wave_c
To profile a non-MPI program:
aps <program> <program args>
The profile data is saved in a subdirectory in your current directory. The directory name is based on the date and time, for example, aps_result_YYYYMMDD/.
To generate the html profile file from the result subdirectory:
aps --report=./aps_result_YYYYMMDD
to create the file aps_report_YYYYMMDD_HHMMSS.html.
You can open the profile data file using a web browswer on your local desktop computer. This option typically offers the best performance.
This section describes how to use performance tools from ARM.
Instructions for how to use MAP is available here.
Instructions for how to use DDT is available here.
Instructions for how to use Performance Reports is available here.
This section describes how to use other performance tools.
Rice University's HPC Toolkit is a collection of performance tools. Instructions for how to use it at OSC is available here.
TAU Commander is a user interface for University of Oregon's TAU Performance System. Instructions for how to use it at OSC is available here.
While our Python installations come with many popular packages installed, you may come upon a case in which you need an additional package that is not installed. If the specific package you are looking for is available from anaconda.org (formerlly binstar.org), you can easily install it and required dependencies by using the conda package manager.
The following steps are an example of how to set up a Python environment and install packages to a local directory using conda. We use the name local
for the environment, but you may use any other name.
We have python
and Miniconda3
modules. python
and miniconda3
module is based on Conda package manager. python
modules are typically recommended when you use Python in a standard environment that we provide. However, if you want to create your own python environment, we recommend using miniconda3
module, since you can start with minimal configurations.
module load miniconda3
Three alternative create commands are listed. These cover the most common cases.
The following will create a minimal Python installation without any extraneous packages:
conda create -n local
If you want to clone the full base Python environment from the system, you may use the following create command:
conda create -n local --clone base
You can augment the command above by listing specific packages you would like installed into the environment. For example, the following will create a minimal Python installation with only the specified packages (in this case, numpy
and babel
):
conda create -n local numpy babel
By default, conda will install the newest versions of the packages it can find. Specific versions can be specified by adding =<version>
after the package name. For example, the following will create a Python installation with Python version 2.7 and NumPy version 1.16:
conda create -n local python=2.7 numpy=1.16
By default, conda will create the environment in your home location $HOME
. To specify a location where the local environment is created, for example, in the project space /fs/ess/ProjectID
, you can use the following command:
conda create --prefix /fs/ess/ProjectID/local
To activate the environment, use the command:
source activate /fs/ess/ProjectID/local
To verify that a clone has been created, use the command
conda info -e
For additional conda command documentation see https://docs.conda.io/projects/conda/en/latest/commands.html#conda-general-commands
Before the created environment can be used, it must be activated.
For the bash shell:
source activate local
At the end of the conda create
step, you may saw a message from the installer that you can use conda activate
command for activating environment. But, please don't use conda activate
command, because it will try to update your shell configuration file and it may cause other issues. So, please use source activate
command as we suggest above.
conda init
to enable the conda activate
command, your shell configuration file such as .bashrc
would have been altered with conda-specific lines. Upon activation of your environment using source activate
, you may notice that the source activate/deactivate
commands cease to function. However, we will be updating miniconda3 modules by May 15th 2024 to ensure that conda activate
no longer alters the .bashrc
file. Consequently, you can safely remove the conda-related lines between # >>> conda initialize >>>
and # <<< conda initialize <<<
from your .bashrc
file and continue using the conda activate
command.On newer versions of Anaconda on the Owens cluster you may also need to perform the removal of the following packages before trying to install your specific packages:
conda remove conda-build
conda remove conda-env
To install additional packages, use the conda install
command. For example, to install the yt
package:
conda install yt
By default, conda will install the newest version if the package that it can find. Specific versions can be specified by adding =<version>
after the package name. For example, to install version 1.16 of the NumPy package:
conda install numpy=1.16
If you need to install packages with pip
, then you can install pip
in your virtual environment by
conda install pip
Then, you can install packages with pip
as
pip install PACKAGE
Please make sure that you have installed pip in your enviroment not using one from the miniconda module. The pip from the miniconda module will give access to the pacakges from the module to your environemt which may or may not be desired. Also set export PYTHONNOUSERSITE=True
to prevent packages from user's .local path.
Now we will test our installed Python package by loading it in Python and checking its location to ensure we are using the correct version. For example, to test that NumPy is installed correctly, run
python -c "from __future__ import print_function; import numpy; print(numpy.__file__)"
and verify that the output generally matches
$HOME/.conda/envs/local/lib/python3.6/site-packages/numpy/__init__.py
To test installations of other packages, replace all instances of numpy
with the name of the package you installed.
Remember, you will need to load the proper version of Python before you go to use your newly installed package. Packages are only installed to one version of Python.
If the method using conda above is not working, or if you prefer, you can consider installing Python packages from the source. Please read HOWTO: install your own Python packages.
See the comparison to these package management tools here:
https://docs.conda.io/projects/conda/en/latest/commands.html#conda-vs-pip-vs-virtualenv-commands
pip
installations are supported:
module load python module list # check which python you just loaded pip install --user --upgrade PACKAGE # where PACKAGE is a valid package name
Note the default installation prefix is set to the system path where OSC users cannot install the package. With the option --user
, the prefix is set to $HOME/.local
where lib, bin, and other top-level folders for the installed packages are placed. Finally, the option --upgrade
will upgrade the existing packages to the newest available version.
The one issue with this approach is portability with multiple Python modules. If you plan to stick with a single Python module, then this should not be an issue. However, if you commonly switch between different Python versions, then be aware of the potential trouble in using the same installation location for all Python versions.
Typically, you can install packages with the methods shown in Install packages section above, but in some cases where the conda package installations have no source from conda channels or have dependency issues, you may consider using pip
in an isolated Python virtual environment.
To create an isolated virtual environment:
module reset python3 -m venv --without-pip $HOME/venv/mytest --prompt "local" source $HOME/venv/mytest/bin/activate (local) curl https://bootstrap.pypa.io/get-pip.py |python # get the newest version of pip (local) deactivate
where we use the path $HOME/venv/mytest
and the name local
for the environment, but you may use any other path and name.
To activate and deactivate the virtual environment:
source $HOME/venv/mytest/bin/activate (local) deactivate
To install packages:
source $HOME/venv/mytest/bin/activate (local) pip install PACKAGE
You don't need the --user
option within the virtual environment.
Conda Test Drive: https://conda.io/docs/test-drive.html
This documentation describes how to install tensorflow package locally in your $HOME space. For more details on Tensorflow see the software page.
Load python module
module load miniconda3/4.10.3-py37
If you need to install tensorflow versions not already provided or would like to use tensorflow in a conda environment proceed with the tutorial below.
First we will create a conda environment which we will later install tensorflow into. See HOWTO: Create and Manage Python Environments for details on how to create and setup your environemnt.
Make sure you activate your environment before proceeding:
source activate MY_ENV
Install the latest version of tensorflow.
conda install tensorflow
You can see all available version for download on conda with conda search tensorflow
There is also a gpu compatable version called tensorflow-gpu
If there are errors on this step you will need to resolve them before continuing.
Now we will test tensorflow package by loading it in python and checking its location to ensure we are using the correct version.
python -c "import tensorflow;print (tensorflow.__file__)"
Output:
$HOME/.conda/envs/MY_ENV/lib/python3.9/site-packages/tensorflow/__init__.py
Remember, you will need to load the proper version of python before you go to use your newly installed package. Packages are only installed to one version of python.
Please refer HOWTO: Use GPU with Tensorflow and PyTorch if you would like to use tenorflow with Gpus.
While we provide a number of Python packages, you may need a package we do not provide. If it is a commonly used package or one that is particularly difficult to compile, you can contact OSC Help for assistance. We also have provided an example below showing how to build and install your own Python packages and make them available inside of Python. These instructions use "bash" shell syntax, which is our default shell. If you are using something else (csh, tcsh, etc), some of the syntax may be different.
Please consider using conda Python package manager before you try to build Python using the method explained here. We have instructions on conda here.
First, you need to collect what you need in order to perform the installation. We will do all of our work in $HOME/local/src
. You should make this directory now.
mkdir -p $HOME/local/src
Next, we will need to download the source code for the package we want to install. In our example, we will use NumExpr. (NumExpr is already available through conda, so it is recommended you use conda to install it: tutorial here. The following steps are simply an example of the procedure you would follow to perform an installation of software unavailable in conda or pip). You can either download the file to your desktop and then upload it to OSC, or directly download it using the wget
utility (if you know the URL for the file).
cd ~/local/src wget https://github.com/pydata/numexpr/releases/download/v2.8.4/numexpr-2.8.4.tar.gz
Next, extract the downloaded file. In this case, since it's a "tar.gz" format, we can use tar to decompress and extract the contents.
tar xvfz numexpr-2.8.4.tar.gz
You can delete the downloaded archive now or keep it should you want to start the installation from scratch.
To build the package, we will want to first create a temporary environment variable to aid in installation. We'll call INSTALL_DIR
.
export INSTALL_DIR=${HOME}/local/numexpr/2.8.4
We are roughly following the convention we use at the system level. This allows us to easily install new versions of software without risking breaking anything that uses older versions. We have specified a folder for the program (numexpr), and for the version (2.8.4). To be consistent with Python installations, we will create a second temporary environment variable that will contain the actual installation location.
export TREE=${INSTALL_DIR}/lib/python3.6/site-packages
Next, make the directory tree.
mkdir -p $TREE
To compile the package, we should switch to the GNU compilers. The system installation of Python was compiled with the GNU compilers, and this will help avoid any unnecessary complications. We will also load the Python package, if it hasn't already been loaded.
module swap intel gnu module load python/3.6-conda5.2
Next, build it. This step may vary a bit, depending on the package you are compiling. You can execute python setup.py --help
to see what options are available. Since we are overriding the install path to one that we can write to and that fits our management plan, we need to use the --prefix
option.
NumExpr build also requires us to set the PYTHONPATH
variable before building:
export PYTHONPATH=$PYTHONPATH:~/local/numexpr/2.8.4/lib/python3.6/site-packages
Find the setup.py
file:
cd numexpr-2.8.4
Now to build:
python setup.py install --prefix=$INSTALL_DIR
At this point, the package is compiled and installed in ~/local/numexpr/2.8.4/lib/python3.6/site-packages
. Occasionally, some files will be installed in ~/local/numexpr/2.8.4/bin
as well. To ensure Python can locate these files, we need to modify our environment.
The most immediate way -- but the one that must be repeated every time you wish to use the package -- is to manually modify your environment. If files are installed in the "bin" directory, you'll need to add it to your path. As before, these examples are for bash, and may have to be modified for other shells. Also, you will have to modify the directories to match your install location.
export PATH=$PATH:~/local/numexpr/2.8.4/bin
And for the Python libraries:
export PYTHONPATH=$PYTHONPATH:~/local/numexpr/2.8.4/lib/python3.6/site-packages
We don't recommend this option, as it is less flexible and can cause conflicts with system software. But if you want, you can modify your .bashrc (or similar file, depending on your shell) to set these environment variables automatically. Be extra careful; making a mistake in .bashrc (or similar) can destroy your login environment in a way that will require a system administrator to fix. To do this, you can copy the lines above modifying $PATH
and $PYTHONPATH
into .bashrc. Remember to test them interactively first. If you destroy your shell interactively, the fix is as simple as logging out and then logging back in. If you break your login environment, you'll have to get our help to fix it.
This is the most complicated option, but it is also the most flexible, as you can have multiple versions of this particular software installed and specify at run-time which one to use. This is incredibly useful if a major feature changes that would break old code, for example. You can see our tutorial on writing modules here, but the important variables to modify are, again, $PATH
and $PYTHONPATH
. You should specify the complete path to your home directory here and not rely on any shortcuts like ~
or $HOME
. Below is a modulefile written in Lua:
If you are following the tutorial on writing modules, you will want to place this file in $HOME/local/share/lmodfiles/numexpr/2.8.4.lua
:
-- This is a Lua modulefile, this file 2.8.4.lua can be located anywhere -- But if you are following a local modulefile location convention, we place them in -- $HOME/local/share/lmodfiles/ -- For numexpr we place it in $HOME/local/share/lmodfiles/numexpr/2.8.4.lua -- This finds your home directory local homedir = os.getenv("HOME") prepend_path("PYTHONPATH", pathJoin(homedir, "/local/numexpr/2.8.4/lib/python3.6/site-packages")) prepend_path(homedir, "local/numexpr/2.8.4/bin")
Once your module is created (again, see the guide), you can use your Python package simply by loading the software module you created.
module use $HOME/local/share/lmodfiles/ module load numexpr/2.8.4
To begin, you need to first create and new conda environment or use an already existing one. See HOWTO: Create Python Environment for more details. In this example we are using python/3.6-conda5.2
Once you have a conda environment created and activated we will now install tensorflow-gpu
into the environment (In this example we will be using version 2.4.1
of tensorflow-gpu
:
conda install tensorflow-gpu=2.4.1
Now that we have the environment set up we can check if tensorflow can access the gpus.
To test the gpu access we will submit the following job onto a compute node with a gpu:
#!/bin/bash
#SBATCH --account <Project-Id>
#SBATCH --job-name Python_ExampleJob
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --gpus-per-node=1
module load python/3.6-conda5.2 cuda/11.8.0
source activate tensorflow_env
# run either of the following commands
python << EOF
import tensorflow as tf
print(tf.test.is_built_with_cuda())
EOF
python << EOF
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
EOF
You will know tensorflow is able to successfully access the gpu if tf.test.is_built_with_cuda()
returns True
and device_lib.list_local_devices()
returns an object with /device:GPU:0
as a listed device.
At this point tensorflow-gpu should be setup to utilize a GPU for its computations.
A GPU can provide signifcant performace imporvements to many machine learnings models. Here is an example python script demonstrating the performace improvements. This is ran on the same environment created in the above section.
from timeit import default_timer as timer import tensorflow as tf from tensorflow import keras import numpy as np (X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data() # scaling image values between 0-1 X_train_scaled = X_train/255 X_test_scaled = X_test/255 # one hot encoding labels y_train_encoded = keras.utils.to_categorical(y_train, num_classes = 10) y_test_encoded = keras.utils.to_categorical(y_test, num_classes = 10) def get_model(): model = keras.Sequential([ keras.layers.Flatten(input_shape=(32,32,3)), keras.layers.Dense(3000, activation='relu'), keras.layers.Dense(1000, activation='relu'), keras.layers.Dense(10, activation='sigmoid') ]) model.compile(optimizer='SGD', loss='categorical_crossentropy', metrics=['accuracy']) return model # GPU with tf.device('/GPU:0'): start = timer() model_cpu = get_model() model_cpu.fit(X_train_scaled, y_train_encoded, epochs = 1) end = timer() print("GPU time: ", end - start) # CPU with tf.device('/CPU:0'): start = timer() model_gpu = get_model() model_gpu.fit(X_train_scaled, y_train_encoded, epochs = 1) end = timer() print("CPU time: ", end - start)
Example code sampled from here
The above code was then submitted in a job with the following script:
#!/bin/bash
#SBATCH --account <Project-Id>
#SBATCH --job-name Python_ExampleJob
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --gpus-per-node=1
module load python/3.6-conda5.2 cuda/11.8.0
source activate tensorflow_env
python tensorflow_example.py
As we can see from the output, the GPU provided a signifcant performace improvement.
GPU time: 3.7491355929996644 CPU time: 78.8043485119997
If you would like to use a gpu for your tensorflow project in a jupyter notebook follow the below commands to set up your environment.
To begin, you need to first create and new conda environment or use an already existing one. See HOWTO: Create Python Environment for more details. In this example we are using python/3.6-conda5.2
Once you have a conda environment created and activated we will now install tensorflow-gpu
into the environment (In this example we will be using version 2.4.1
of tensorflow-gpu
:
conda install tensorflow-gpu=2.4.1
Now we will setup a jupyter kernel. See HOWTO: Use a Conda/Virtual Environment With Jupyter for details on how to create a jupyter kernel with your conda environment.
Once you have the kernel created see Usage section of Python page for more details on accessing the Jupyter app from OnDemand.
Now you are all setup to use a gpu with tensorflow on a juptyer notebook.
To begin, you need to first create and new conda environment or use an already existing one. See HOWTO: Create Python Environment for more details. In this example we are using python/3.6-conda5.2
Once you have a conda environment created and activated we will now install pytorch
into the environment (In the example we will be using version 1.3.1
of pytorch
:
conda install pytorch=1.3.1
Now that we have the environment set up we can check if pytorch can access the gpus.
To test the gpu access we will submit the following job onto a compute node with a gpu:
#!/bin/bash
#SBATCH --account <Project-Id>
#SBATCH --job-name Python_ExampleJob
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --gpus-per-node=1
ml python/3.6-conda5.2 cuda/11.8.0
source activate pytorch_env
python << EOF
import torch
print(torch.cuda.is_available())
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
EOF
You will know pytorch is able to successfully access the gpu if torch.cuda.is_available()
returns True
and torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
returns cuda:0
.
At this point PyTorch should be setup to utilize a GPU for its computations.
Here is an example pytorch script demonstrating the performace improvements from GPUs
import torch from timeit import default_timer as timer # check for cuda availability print("Cuda: ", torch.cuda.is_available()) device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print("Device: ", device) #GPU b = torch.ones(4000,4000).cuda() # Create matrix on GPU memory start_time = timer() for _ in range(1000): b += b elapsed_time = timer() - start_time print('GPU time = ',elapsed_time) #CPU a = torch.ones(4000,4000) # Create matrix on CPU memory start_time = timer() for _ in range(1000): a += a elapsed_time = timer() - start_time print('CPU time = ',elapsed_time)
The above code was then submitted in a job with the following script:
#!/bin/bash
#SBATCH --account <Project-Id>
#SBATCH --job-name Python_ExampleJob
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --gpus-per-node=1
ml python/3.6-conda5.2 cuda/11.8.0
source activate pytorch_env
python pytorch_example.py
As we can see from the output, the GPU provided a signifcant performace improvement.
GPU time = 0.0053490259997488465 CPU time = 4.232843188998231
If you would like to use a gpu for your PyTorch project in a jupyter notebook follow the below commands to set up your environment.
To begin, you need to first create and new conda environment or use an already existing one. See HOWTO: Create Python Environment for more details. In this example we are using python/3.6-conda5.2
Once you have a conda environment created and activated we will now install pytorch
into the environment (In the example we will be using version 1.3.1
of pytorch
:
conda install pytorch=1.3.1
You also may need to install numba
for PyTorch to access a gpu from the jupter notebook.
conda install numba=0.54.1
Now we will setup a jupyter kernel. See HOWTO: Use a Conda/Virtual Environment With Jupyter for details on how to create a jupyter kernel with your conda environment.
Once you have the kernel created see Usage section of Python page for more details on accessing the Jupyter app from OnDemand.
Now you are all setup to use a gpu with PyTorch on a juptyer notebook.
If you are using Tensorflow or PyTorch you may want to also consider using Horovod. Horovod will take single-GPU training scripts and scale it to train across many GPUs in parallel.
This article focuses on debugging strategies for C/C++ codes, but many are applicable to other languages as well.
This approach is a great starting point. Say you have written some code, and it does not do what you expect it to do. You have stared at it for a few minutes, but you cannot seem to spot the problem.
Try explaining what the problem is to a rubber duck. Then, walk the rubber duck through your code, line by line, telling it what it does. Don’t have a rubber duck? Any inanimate object will do (or even an animate one if you can grab a friend).
It sounds silly, but rubber duck debugging helps you to get out of your head, and hopefully look at your code from a new perspective. Saying what your code does (or is supposed to do) out loud has a good chance of revealing where your understanding might not be as good as you think it is.
You’ve written a whole bunch of new code. It takes some inputs, chugs along for a while, and then creates some outputs. Somewhere along this process, something goes wrong. You know this because the output is not at all what you expected. Unfortunately, you have no idea where things are going wrong in the code.
This might be a good time to try out printf() debugging. It’s as simple as its name implies: simply add (more) printf() statements to your code. You’ve likely seen this being used. It’s the name given to the infamous ‘printf(“here”);’ calls used to verify that a particular codepath is indeed taken.
Consider printing out arguments and return values to key functions. Or, the results or summary statistics from large calculations. These values can be used as “sanity checks” to ensure that up until that point in the code, everything is going as expected.
Assertion calls, such as "assert(...)", can also be used for a similar purpose. However, often the positive feedback you get from print statements is helpful in when you’re debugging. Seeing a valid result printed in standard out or a log file tells you positively that at least something is working correctly.
Debuggers are tools that can be used to interactively (or with scripts) debug your code. A fairly common debugger for C and C++ codes is gdb. Many guides exist online for using gdb with your code.
OSC systems also provide the ARM DDT debugger. This debugger is designed for use with HPC codes and is arguably easier to use than gdb. It can be used to debug MPI programs as well.
Debuggers allow you to interact with the program while it is running. You can do things like read and write variable values, or check to see if/when certain functions are called.
Okay, this one isn’t exactly a debugging strategy. It’s a method to catch bugs early, and even prevent the addition of bugs. Writing a test suite for your code that’s easy to run (and ideally fast) lets you test new changes to ensure they don’t break existing functionality.
There are lots of different philosophies on testing software. Too many to cover here. Here’s two concepts that are worth looking into: unit testing and system testing.
The idea behind unit testing is writing tests for small “units” of code. These are often functions or classes. If you know that the small pieces that make up your code work, then you’ll have more confidence in the overall assembled program. There’s an added architecture benefit here too. Writing code that is testable in the first place often results in code that’s broken up into separate logical pieces (google “separation of concerns”). This makes your code more modular and less “spaghetti-like”. Your code will be easier to modify and understand.
The second concept – system testing – involves writing tests that run your entire program. These often take longer than unit tests, but have the added benefit that they’ll let you know whether or not your entire program still works after introducing a new change.
When writing tests (both system and unit tests), it’s often helpful to include a couple different inputs. Occasionally a program may work just fine for one input, but fail horribly with another input.
Maybe your code takes a couple hours (or longer…) to run. There’s a bug in it, but every time you try to fix it, you have to wait a few hours to see if the fix worked. This is driving you crazy.
A possible approach to make your life easier is to try to make a Minimal, Reproducible Example (see this stackoverflow page for information).
Try to extract just the code that fails, from your program, and also its inputs. Wrap this up into a separate program. This allows you to run just the code that failed, hopefully greatly reducing the time it takes to test out fixes to the problem.
Once you have this example, can you make it smaller? Maybe take out some code that’s not needed to reproduce the bug, or shrink the input even further? Doing this might help you solve the problem.
In December 2021 OSC updated its firewall to enhance security. As a result, SSH sessions are being closed more quickly than they used to be. It is very easy to modify your SSH options in the client you use to connect to OSC to keep your connection open.
In ~/.ssh/config (use the command touch ~/.ssh/config
to create it if there is no exisitng one), you can set 3 options:
TCPKeepAlive=no ServerAliveInterval=60 ServerAliveCountMax=5
Please refer to your SSH client documentation for how to set these options in your client.
An eligible principal investigator (PI) heads a project account and can authorize/remove user accounts under the project account (please check our Allocations and Accounts documentation for more details). This document shows you how to identify users on a project account and check the status of each user.
If the project account (projectID) is known, the OSCgetent
command will list all users on the project:
$ OSCgetent group projectID
The returned information is in the format of:
projectID:*:gid: list of user IDs
gid is the group identifier number unique for the project account projectID.
For example, the command OSCgetent group PZS0712
lists all users on the project account PZS0712 as below:
$ OSCgetent group PZS0712 PZS0712:*:5513:amarcum,guilfoos,hhamblin,kcahill,xwang
Multiple groups can also be queried at once.
For Example, the command OSCgetent group PZS0712 PZS0726 lists all users on both PZS0712 and PZS0726:
PZS0712:*:5513:amarcum,guilfoos,hhamblin,kcahill,xwang PZS0726:*:6129:amarcum,kkappel
Details on a project can also be obtained along with the user list using the OSCfinger
command.
$ OSCfinger -g projectID
This returns:
Group: projectID GID: XXXX Status: 'active/restricted/etc' Type: XX Principal Investigator: 'PI email' Admins: NA Members: 'list of users' Category: NA Institution: 'affliated institution' Description: 'short description' ---
If the project account is not known, but the username is known, use the OSCfinger
command to list all of the groups the user belongs to:
OSCfinger username
The returned information is in the format of:
Login: username Name: First Last Directory: home directory path Shell: /bin/bash E-mail: user's email address Primary Group: user's primary project Groups: list of projects and other groups user is in Password Changed: date password was last changed Password Expires: date password expires Login Disabled: TRUE/FALSE Password Expired: TRUE/FALSE Current Logins: Displays if user is currently logged in and from where/when
For example, with the username as amarcum, the command OSCfinger amarcum
returns the information as below:
$ OSCfinger amarcum Login: amarcum Name: Antonio Marcum Directory: /users/PZS0712/amarcum Shell: /bin/bash E-mail: amarcum@osc.edu Primary Group: PZS0712 Groups: sts,ruby,l2supprt,oscall,clntstf,oscstaff,clntall,PZS0712,PZS0726 Password Changed: May 12 2019 15:47 (calculated) Password Expires: Aug 11 2019 12:05 AM Login Disabled: FALSE Password Expired: FALSE Current Logins: On since Mar 07 2019 12:12 on pts/14 from pitzer-login01.hpc.osc.edu ----
If the project account or username is not known, use the OSCfinger -e
command with the '-e' flag to get the user account based on the user's name.
Use the following command to list all of the user accounts associated with a First and Last name:
$ OSCfinger -e 'First Last'
For example, with user's first name as Summer and last name as Wang, the command
OSCfinger -e 'Summer Wang'
returns the information as below:
$ OSCfinger -e 'Summer Wang' Login: xwang Name: Summer Wang Directory: /users/oscgen/xwang Shell: /bin/bash E-mail: xwang@osc.edu Primary Group: PZS0712 Groups: amber,abaqus,GaussC,comsol,foampro,sts,awsmdev,awesim,ruby,matlab,aasheats,mars,ansysflu,wrigley,lgfuel,l2supprt,fsl,oscall,clntstf,oscstaff,singadm,clntall,dhgremot,fsurfer,PZS0530,PCON0003,PZS0680,PMIU0149,PZS0712,PAS1448 Password Changed: Jan 08 2019 11:41 Password Expires: Jul 08 2019 12:05 AM Login Disabled: FALSE Password Expired: FALSE ---
Once you know the user account username, follow the discussions in the previous section identify users on a project to get all user accounts on the project. Please contact OSC Help if you have any questions.
Use the OSCfinger
command to check the status of a user account as below:
OSCfinger username
For example, if the username is xwang, the command OSCfinger xwang
will return:
$ OSCfinger xwang Login: xwang Name: Summer Wang Directory: /users/oscgen/xwang Shell: /bin/bash E-mail: xwang@osc.edu Primary Group: PZS0712 Groups: amber,abaqus,GaussC,comsol,foampro,sts,awsmdev,awesim,ruby,matlab,aasheats,mars,ansysflu,wrigley,lgfuel,l2supprt,fsl,oscall,clntstf,oscstaff,singadm,clntall,dhgremot,fsurfer,PZS0530,PCON0003,PZS0680,PMIU0149,PZS0712,PAS1448 Password Changed: Jan 08 2019 11:41 Password Expires: Jul 08 2019 12:05 AM Login Disabled: FALSE Password Expired: FALSE ---
Directory: /users/oscgen/xwang
Shell: /bin/bash
). If the information is Shell:/access/denied
, it means this user account has been either archived or restricted. Please contact OSC Help if you'd like to reactivate this user account.Mail forwarded to xwang@osc.edu
). Please contact OSC Help if the email address associated with this user account has been changed to ensure important notifications/messages/reminders from OSC may be received in a timely manner.All users see their file system usage statistics when logging in, like so:
As of 2018-01-25T04:02:23.749853 userid userID on /users/projectID used XGB of quota 500GB and Y files of quota 1000000 files
The information is from the file /users/reporting/storage/quota/*_quota.txt , which is updated twice a day. Some users may see multiple lines associated with a username, as well as information on project space usage and quota of their Primary project, if there is one. The usage and quota of the home directory of a username is provided by the line including the file server your home directory is on (for more information, please visit Home Directories), while others (generated due to file copy) can be safely ignored.
You can check any user's home directory or a project's project space usage and quota by running:
grep -h 'userID' OR 'projectID' /users/reporting/storage/quota/*_quota.txt
Here is an example of project PZS0712:
$ grep -h PZS0712 /users/reporting/storage/quota/*_quota.txt As of 2019-03-07T13:55:01.000000 project/group PZS0712 on /fs/project used 262 GiB of quota 2048 GiB and 166987 files of quota 200000 files As of 2019-03-07T13:55:01.000000 userid xwang on /fs/project/PZS0712 used 0 GiB of quota 0 GiB and 21 files of quota 0 files As of 2019-03-07T13:55:01.000000 userid dheisterberg on /fs/project/PZS0712 used 262 GiB of quota 0 GiB and 166961 files of quota 0 files As of 2019-03-07T13:55:01.000000 userid amarcum on /fs/project/PZS0712 used 0 GiB of quota 0 GiB and 2 files of quota 0 files As of 2019-03-07T13:55:01.000000 userid root on /fs/project/PZS0712 used 0 GiB of quota 0 GiB and 2 files of quota 0 files As of 2019-03-07T13:55:01.000000 userid guilfoos on /fs/project/PZS0712 used 0 GiB of quota 0 GiB and 1 files of quota 0 files As of 2019-03-07T13:51:23.000000 userid amarcum on /users/PZS0712 used 399.86 MiB of quota 500 GiB and 8710 files of quota 1000000 files
Here is an example for username amarcum:
$ grep -h amarcum /users/reporting/storage/quota/*_quota.txt As of 2019-03-07T13:55:01.000000 userid amarcum on /fs/project/PZS0712 used 0 GiB of quota 0 GiB and 2 files of quota 0 files As of 2019-03-07T13:56:39.000000 userid amarcum on /users/PZS0645 used 4.00 KiB of quota 500 GiB and 1 files of quota 1000000 files As of 2019-03-07T13:56:39.000000 userid amarcum on /users/PZS0712 used 399.86 MiB of quota 500 GiB and 8710 files of quota 1000000 files
The OSCusage
commnad can provide detailed information about computational usage for a given project and user.
See the OSCusage command page for details.
If you need to use a MATLAB toolbox that is not provided through our installations. You can follow these instructions, and if you have any difficulties you can contact OSC Help for assistance.
First, we recommend making a new directory within your home directory in order to keep everything organized. You can use the unix command to make a new directory: "mkdir"
Now you can download the toolbox either to your desktop, and then upload it to OSC, or directly download it using the "wget" utility (if you know the URL for the file).
Now you can extract the downloaded file.
There are two methods on how to add the MATLAB toolbox path.
Method 1: Load up the Matlab GUI and click on "Set Path" and "Add folder"
Method 2: Use the "addpath" fuction in your script. More information on the function can be found here: https://www.mathworks.com/help/matlab/ref/addpath.html
Please refer to the instructions given alongside the toolbox. They should contain instructions on how to run the toolbox.
While we provide a number of Perl modules, you may need a module we do not provide. If it is a commonly used module, or one that is particularly difficult to compile, you can contact OSC Help for assistance, but we have provided an example below showing how to build and install your own Perl modules. Note, these instructions use "bash" shell syntax; this is our default shell, but if you are using something else (csh, tcsh, etc), some of the syntax may be different.
CPAN, the Comprehensive Perl Achive Network, is the primary source for publishing and fetching the latest modules and libraries for the Perl programming language. The default method for installing Perl modules using the "CPAN Shell", provides users with a great deal of power and flexibility but at the cost of a complex configuration and inelegant default setup.
To use CPAN Minus with the system Perl (version 5.16.3), we need to ensure that the "cpanminus" module is loaded, if it hasn't been loaded already.
module load cpanminus
Please note that this step is not required if you have already loaded a version of Perl using the module load command.
Next, in order to use cpanminus, you will need to run the following command only ONCE:
perl -I $CPANMINUS_INC -Mlocal::lib
In most cases, using CPAN Minus to install modules is as simple as issuing a command in the following form:
cpanm [Module::Name]
For example, below are three examples of installing perl modules:
cpanm Math::CDF cpanm SET::IntervalTree cpanm DB_File
To test a perl module import, here are some examples below:
perl -e "require Math::CDF" perl -e "require Set::IntervallTree" perl -e "require DB_File"
The modules are installed correctly if no output is printed.
To show the local modules you have installed in your user account:
perldoc perllocal
Reseting Module Collection
If you should ever want to start over with your perl module collection, delete the following folders:
rm -r ~/perl5 rm -r ~/.cpanm
Sometimes the best way to get access to a piece of software on the HPC systems is to install it yourself as a "local install". This document will walk you through the OSC-recommended procedure for maintaining local installs in your home directory or project space. The majority of this document describes the process of "manually" building and installing your software. We also show a partially automated approach through the use of a bash script in the Install Script section near the end.
Before installing your software, you should first prepare a place for it to live. We recommend the following directory structure, which you should create in the top-level of your home directory:
local
|-- src
|-- share
`-- lmodfiles
This structure is analogous to how OSC organizes the software we provide. Each directory serves a specific purpose:
local
- Gathers all the files related to your local installs into one directory, rather than cluttering your home directory. Applications will be installed into this directory with the format "appname/version". This allows you to easily store multiple versions of a particular software install if necessary.local/src
- Stores the installers -- generally source directories -- for your software. Also, stores the compressed archives ("tarballs") of your installers; useful if you want to reinstall later using different build options.local/share/lmodfiles
- The standard place to store module files, which will allow you to dynamically add or remove locally installed applications from your environment.You can create this structure with one command:
mkdir -p $HOME/local/src $HOME/local/share/lmodfiles
(NOTE: $HOME is defined by the shell as the full path of your home directory. You can view it from the command line with the command echo $HOME
.)
Now that you have your directory structure created, you can install your software. For demonstration purposes, we will install a local copy of Git.
First, we need to get the source code onto the HPC filesystem. The easiest thing to do is find a download link, copy it, and use the wget
tool to download it on the HPC. We'll download this into $HOME/local/src
:
cd $HOME/local/src
wget https://github.com/git/git/archive/v2.9.0.tar.gz
Now extract the tar file:
tar zxvf
v2.9.0.tar.gz
Next, we'll go into the source directory and build the program. Consult your application's documentation to determine how to install into $HOME/local/"software_name"/"version"
. Replace "software_name"
with the software's name and "version"
with the version you are installing, as demonstrated below. In this case, we'll use the configure
tool's --prefix
option to specify the install location.
You'll also want to specify a few variables to help make your application more compatible with our systems. We recommend specifying that you wish to use the Intel compilers and that you want to link the Intel libraries statically. This will prevent you from having to have the Intel module loaded in order to use your program. To accomplish this, add CC=icc CFLAGS=-static-intel
to the end of your invocation of configure
. If your application does not use configure
, you can generally still set these variables somewhere in its Makefile or build script.
Then, we can build Git using the following commands:
cd git-2.9.0
autoconf # this creates the configure file
./configure --prefix=$HOME/local/git/2.9.0 CC=icc CFLAGS=-static-intel
make && make install
Your application should now be fully installed. However, before you can use it you will need to add the installation's directories to your path. To do this, you will need to create a module.
Modules allow you to dynamically alter your environment to define environment variables and bring executables, libraries, and other features into your shell's search paths.
We can use the mkmod script to create a simple Lua module for the Git installation:
module load mkmod create_module.sh git 2.9.0 $HOME/local/git/2.9.0
It will create the module $HOME/local/share/lmodfiles/git/2.9.0.lua
. Please note that by default our mkmod script only creates module files that define some basic environment variables PATH
, LD_LIBRARY_PATH
, MANPATH
, and GIT_HOME
. These default variables may not cover all paths desired. We can overwrite these defaults in this way:
module load mkmod TOPDIR_LDPATH_LIST="lib:lib64" \ TOPDIR_PATH_LIST="bin:exe" \ create_module.sh git 2.9.0 $HOME/local/git/2.9.0
This adds $GIT_HOME/bin
, $GIT_HOME/exe
to PATH
and $GIT_HOME/lib
, $GIT_HOME/lib64
to LD_LIBRARY_PATH
.
We can also add other variables by using ENV1, ENV2, and more. For example, suppose we want to change the default editor to vim for Git:
module load mkmod ENV1="GIT_EDITOR=vim" \ create_module.sh git 2.9.0 $HOME/local/git/2.9.0
We will be using the filename 2.9.0.lua ("version".lua). A simple Lua module for our Git installation would be:
-- Local Variables local name = "git" local version = "2.9.0" -- Locate Home Directory local homedir = os.getenv("HOME") local root = pathJoin(homedir, "local", name, version)
-- Set Basic Paths prepend_path("PATH", pathJoin(root, "bin"))
prepend_path("LD_LIBRARY_PATH", root .. "/lib") prepend_path("LIBRARY_PATH", root .. "/lib") prepend_path("INCLUDE", root .. "/include") prepend_path("CPATH", root .. "/include") prepend_path("PKG_CONFIG_PATH", root .. "/lib/pkgconfig")prepend_path("MANPATH", root .. "/share/man")
NOTE: For future module files, copy our sample modulefile from ~support/doc/modules/sample_module.lua
. This module file follows the recommended design patterns laid out above and includes samples of many common module operations
Any module file you create should be saved into your local lmodfiles directory ($HOME/local/share/lmodfiles). To prepare for future software installations, create a subdirectory within lmodfiles named after your software and add one module file to that directory for each version of the software installed.
In the case of our Git example, you should create the directory $HOME/local/share/lmodfiles/git
and create a module file within that directory named 2.9.0.lua
.
To make this module usable, you need to tell lmod where to look for it. You can do this by issuing the command module use $HOME/local/share/lmodfiles
in our example. You can see this change by performing module avail
. This will allow you to load your software using either module load git
or module load git/2.9.0
.
module use
$HOME/local/share/lmodfiles and module load "software_name" need to be entered into the command line every time you enter a new session on the system.If you install another version later on (lets say version 2.9.1) and want to create a module file for it, you need to make sure you call it 2.9.1.lua
. When loading Git, lmod will automatically load the newer version. If you need to go back to an older version, you can do so by specifying the version you want: module load git/2.9.0
.
To make sure you have the correct module file loaded, type which git
which should emit "~/local/git/2.9.0/bin/git" (NOTE: ~ is equivalent to $HOME).
To make sure the software was installed correctly and that the module is working, type git --version which should emit "git version 2.9.0".
Simplified versions of the scripts used to manage the central OSC software installations are provided at ~support/share/install-script
. The idea is that you provide the minimal commands needed to obtain, compile, and install the software (usually some variation on wget
, tar
, ./configure
, make
, and make install
) in a script, which then sources an OSC-maintained template that provides all of the "boilerplate" commands to create and manage a directory structure similar to that outlined in the Getting Started section above. You can copy an example install script from ~support/share/install-script/install-osc_sample.sh
and follow the notes in that script, as well as in ~support/share/install-script/README.md
, to modify it to install software of your choosing.
$HOME/osc_apps/lmodfiles
, so you will need to run module use $HOME/osc_apps/lmodfiles
and module load [software-name]
every time you enter a new session on the system and want to use the software that you have installed.For more information about modules, be sure to read the webpage indicated at the end of module help
. If you have any questions about modules or local installations, feel free to contact the OSC Help Desk and oschelp@osc.edu.
An ACL (access control list) is a list of permissions associated with a file or directory. These permissions allow you to restrict access to a certain file or directory by user or group.
OSC supports NFSv4 ACL on our home directory and POSIX ACL on our project and scratch file systems. Please see the how to use NFSv4 ACL for home directory ACL management and how to use POSIX ACL for managing ACLs in project and scratch file systems.
This document shows you how to use the NFSv4 ACL permissions system. An ACL (access control list) is a list of permissions associated with a file or directory. These permissions allow you to restrict access to a certian file or directory by user or group. NFSv4 ACLs provide more specific options than typical POSIX read/write/execute permissions used in most systems.
These commands are useful for managing ACLs in the dir locations of /users/<project-code>.
This is an example of an NFSv4 ACL
A::user@nfsdomain.org:rxtncy
A::alice@nfsdomain.org:rxtncy
A::alice@nfsdomain.org:rxtncy
A::alice@nfsdomain.org:rxtncy
The following sections will break down this example from left to right and provide more usage options
The 'A' in the example is known as the ACE (access control entry) type. The 'A' denotes "Allow" meaning this ACL is allowing the user or group to perform actions requiring permissions. Anything that is not explicitly allowed is denied by default.
The above example could have a distinction known as a flag shown below
A:d:user@osc.edu:rxtncy
The 'd' used above is called an inheritence flag. This makes it so the ACL set on this directory will be automatically established on any new subdirectories. Inheritence flags only work on directories and not files. Multiple inheritence flags can be used in combonation or omitted entirely. Examples of inheritence flags are listed below:
Flag | Name | Function |
---|---|---|
d | directory-inherit | New subdirectories will have the same ACE |
f | file-inherit | New files will have the same ACE minus the inheritence flags |
n | no-propogate inherit | New subdirectories will inherit the ACE minus the inheritence flags |
i | inherit-only | New files and subdirectories will have this ACE but the ACE for the directory with the flag is null |
The 'user@nfsdomain.org' is a principal. The principle denotes the people the ACL is allowing access to. Principals can be the following:
A:g:group@osc.edu:rxtncy
The 'rxtncy' are the permissions the ACE is allowing. Permissions can be used in combonation with each other. A list of permissions and what they do can be found below:
Permission | Function |
---|---|
r | read-data (files) / list-directory (directories) |
w | write-data (files) / create-file (directories) |
a | append-data (files) / create-subdirectory (directories) |
x | execute (files) / change-directory (directories) |
d | delete the file/directory |
D | delete-child : remove a file or subdirectory from the given directory (directories only) |
t | read the attributes of the file/directory |
T | write the attribute of the file/directory |
n | read the named attributes of the file/directory |
N | write the named attributes of the file/directory |
c | read the file/directory ACL |
C | write the file/directory ACL |
o | change ownership of the file/directory |
Note: Aliases such as 'R', 'W', and 'X' can be used as permissions. These work simlarly to POSIX Read/Write/Execute. More detail can be found below.
Alias | Name | Expansion |
---|---|---|
R | Read | rntcy |
W | Write | watTNcCy (with D added to directory ACE's) |
X | Execute | xtcy |
This section will show you how to set, modify, and view ACLs
To set an ACE use this command:
nfs4_setfacl [OPTIONS] COMMAND file
To modify an ACE, use this command:
nfs4_editfacl [OPTIONS] file
Where file is the name of your file or directory. More information on Options and Commands can be found below.
Commands are only used when first setting an ACE. Commands and their uses are listed below.
COMMAND | FUNCTION |
---|---|
-a acl_spec [index] | add ACL entries in acl_spec at index (DEFAULT: 1) |
-x acl_spec | index | remove ACL entries or entry-at-index from ACL |
-A file [index] | read ACL entries to add from file |
-X file | read ACL entries to remove from file |
-s acl_spec | set ACL to acl_spec (replaces existing ACL) |
-S file | read ACL entries to set from file |
-m from_ace to_ace | modify in-place: replace 'from_ace' with 'to_ace' |
Options can be used in combination or ommitted entirely. A list of options is shown below:
OPTION | NAME | FUNCTION |
---|---|---|
-R | recursive | Applies ACE to a directory's files and subdirectories |
-L | logical | Used with -R, follows symbolic links |
-P | physical | Used with -R, skips symbolic links |
To view ACLs, use the following command:
nfs4_getfacl file
Where file is your file or directory
First, make the top-level of home dir group executable.
nfs4_setfacl -a A:g:<group>@osc.edu:X $HOME
Next create a new folder to store shared data
mkdir share_group
Move all data to be shared that already exists to this folder
mv <src> ~/share_group
Apply the acl for all current files and dirs under ~/share_group, and set acl so that new files created there will automatically have proper group permissions
nfs4_setfacl -R -a A:dfg:<group>@osc.edu:RX ~/share_group
One can also specify the acl to be used in a single file, then apply that acl to avoid duplicate entries and keep the acl entries consistent.
$ cat << EOF > ~/group_acl.txt A:fdg:clntstf@osc.edu:rxtncy A::OWNER@:rwaDxtTnNcCy A:g:GROUP@:tcy A::EVERYONE@:rxtncy EOF $ nfs4_setfacl -R -S ~/group_acl.txt ~/share_group
Assume that you want to share a directory (e.g data) and its files and subdirectories, but it is not readable by other users,
> ls -ld /users/PAA1234/john/data drwxr-x--- 3 john PAA1234 4096 Nov 21 11:59 /users/PAA1234/john/data
Like before, allow the user execute permissions to $HOME.
> nfs4_setfacl -a A::userid@osc.edu:X $HOME
set an ACL to the directory 'data' to allow specific user access:
> cd /users/PAA1234/john > nfs4_setfacl -R -a A:df:userid@osc.edu:RX data
or to to allow a specific group access:
> cd /users/PAA1234/john > nfs4_setfacl -R -a A:dfg:groupname@osc.edu:RX data
You can repeat the above commands to add more users or groups.
Sometimes one wishes to share their entire home dir with a particular group. Care should be taken to only share folders with data and not any hidden dirs.
~/.ssh
dir, which should always have read permissions only for the user that owns it.Use the below command to only assign group read permissions only non-hidden dirs.
After sharing an entire home dir with a group, you can still create a single share folder with the previous instructions to share different data with a different group only. So, all non-hidden dirs in your home dir would be readable by group_a, but a new folder named 'group_b_share' can be created and its acl altered to only share its contents with group_b.
Please contact oschelp@osc.edu if there are any questions.
This document shows you how to use the POSIX ACL permissions system. An ACL (access control list) is a list of permissions associated with a file or directory. These permissions allow you to restrict access to a certian file or directory by user or group.
These commands are useful for project and scratch dirs located in /fs/ess.
An example of a basic POSIX ACL would look like this:
# file: foo.txt # owner: tellison # group: PZSXXXX user::rw- group::r-- other::r--
The first three lines list basic information about the file/directory in question: the file name, the primary owner/creator of the file, and the primary group that has permissions on the file. The following three lines show the file access permissions for the primary user, the primary group, and any other users. POSIX ACLs use the basic rwx permissions, explaned in the following table:
Permission | Explanation |
---|---|
r | Read-Only Permissions |
w | Write-Only Permissions |
x |
Execute-Only Permissions |
This section will show you how to set and view ACLs, using the setfacl and getfacl commands
The getfacl command displays a file or directory's ACL. This command is used as the following
$ getfacl [OPTION] file
Where file is the file or directory you are trying to view. Common options include:
Flag | Description |
---|---|
-a/--access | Display file access control list only |
-d/--default | Display default access control list only (only primary access), which determines the default permissions of any files/directories created in this directory |
-R/--recursive | Display ACLs for subdirectories |
-p/--absolute-names | Don't strip leading '/' in pathnames |
A simple getfacl call would look like the following:
$ getfacl foo.txt # file: foo.txt # owner: user # group: PZSXXXX user::rw- group::r-- other::r--
A recursive getfacl call through subdirectories will list each subdirectories ACL separately
$ getfacl -R foo/ # file: foo/ # owner: user # group: PZSXXXX user::rwx group::r-x other::r-x # file: foo//foo.txt # owner: user # group: PZSXXXX user::rwx group::--- other::--- # file: foo//bar # owner: user # group: PZSXXXX user::rwx group::--- other::--- # file: foo//bar/foobar.py # owner: user # group: PZSXXXX user::rwx group::--- other::---
The setfacl command allows you to set a file or directory's ACL. This command is used as the following
$ setfacl [OPTION] COMMAND file
Where file is the file or directory you are trying to modify.
setfacl takes several commands to modify a file or directory's ACL
Command | Function |
---|---|
-m/--modify=acl |
modify the current ACL(s) of files. Use as the following setfacl -m u/g:user/group:r/w/x file |
-M/--modify-file=file |
read ACL entries to modify from a file. Use as the following setfaclt -M file_with_acl_permissions file_to_modify |
-x/--remove=acl |
remove entries from ACL(s) from files. Use as the following setfaclt -x u/g:user/group:r/w/x file |
-X/--remove-file=file |
read ACL entries to remove from a file. Use as the following setfaclt -X file_with_acl_permissions file_to_modify |
-b/--remove-all | Remove all extended ACL permissions |
Common option flags for setfacl are as follows:
Option | Function |
---|---|
-R/--recursive | Recurse through subdirectories |
-d/--default | Apply modifications to default ACLs |
--test | test ACL modifications (ACLs are not modified |
You can set a specific user's access priviledges using the following
setfacl -m u:username:-wx foo.txt
Similarly, a group's access priviledges can be set using the following
setfacl -m g:PZSXXXX:rw- foo.txt
You can remove a specific user's access using the following
setfacl -x user:username foo.txt
Grant a user recursive read access to a dir and all files/dirs under it (notice that the capital 'X' is used to provide execute permissions only to dirs and not files):
setfacl -R -m u:username:r-X shared-dir
Set a dir so that any newly created files or dirs under will inherit the parent dirs facl:
setfacl -d -m u:username:r-X shared-dir
This HOWTO will demonstrate how to lower ones' disk space usage. The following procedures can be applied to all of OSC's file systems.
We recommend users regularly check their data usage and clean out old data that is no longer needed.
Users who need assistance lowering their data usage can contact OSC Help.
Users should ensure that their jobs are written in such a way that temporary data is not saved to permanent file systems, such as the project space file system or their home directory.
If your job copies data from the scratch file system or its node's local disk ($TMPDIR
) back to a permanent file system, such as the project space file system or a home directory ( /users/PXX####/xxx####/
), you should ensure you are only copying the files you will need later.
The following commands will help you identify old data using the find
command.
find
commands may produce an excessive amount of output. To terminate the command while it is running, click CTRL + C
.This command will recursively search the users home directory and give a detailed listing of all files not accessed in the past 100 days.
The last access time atime
is updated when a file is opened by any operation, including grep
, cat
, head
, sort
, etc.
find ~ -atime +100 -exec ls -l {} \;
~
with the path you wish to search. A period .
can be used to search the current directory.100
with your desired number of days.find
, you can add | awk '{s+=$5} END {print "Total SIZE (bytes): " s}'
to the end of the command:find ~ -atime +100 -exec ls -l {} \;| awk '{s+=$5} END {print "Total SIZE (bytes): " s}'
This command will recursively search the users home directory and give a detailed listing of all files not modified in the past 100 days.
The last modified time mtime
is updated when a file's contents are updated or saved. Viewing a file will not update the last modified time.
find ~ -mtime +100 -exec ls -l {} \;
~
with the path you wish to search. A period .
can be used to search the current directory.100
with your desired number of days.find
, you can add | awk '{s+=$5} END {print "Total SIZE (bytes): " s}'
to the end of the command:find ~ -mtime +100 -exec ls -l {} \;| awk '{s+=$5} END {print "Total SIZE (bytes): " s}'
Adding the -size <size>
option and argument to the find command allows you to only view files larger than a certain size. This option and argument can be added to any other find command.
For example, to view all files in a users home directory that are larger than 1GB:
find ~ -size +1G -exec ls -l {} \;
Use the following command to view list dirs under <target-dir> and number of files contained in the dirs.
du --inodes -d 1 <target-dir>
If you no longer need the old data, you can delete it using the rm
command.
If you need to delete a whole directory tree (a directory and all of its subcontents, including other directories), you can use the rm -R
command.
For example, the following command will delete the data directory in a users home directory:
rm -R ~/data
If you would like to be prompted for confirmation before deleting every file, use the -i
option.
rm -Ri ~/data
Enter y
or n
when prompted. Simply pressing the enter button will default to n
.
find
The rm
command can be combined with any find
command to delete the files found. The syntax for doing so is:
find <location> <other find options> -exec rm -i {} \;
Where <other find options>
can include one or more of the options -atime <time>
, -mtime <time>
, and -size <size>
.
The following command would find all files in the ~/data
directory 1G or larger that have not been accessed in the past 100 days, and then prompt for confirmation to delete each file:
find ~/data -atime +100 -size 1G -exec rm -i {} \;
If you are absolutely sure the files identified by find
are okay to delete you can remove the -i
option to rm
and you will not be prompted. Extreme caution should be used when doing so!
If you still need the data but do not plan on needing the data in the immediate future, contact OSC Help to discuss moving the data to an archive file system. Requests for data to be moved to the archive file system should be larger than 1TB.
If you need the data but do not access the data frequently, you should compress the data using tar or gzip.
If you have the space available locally you can transfer your data there using sftp or Globus.
Globus is recommended for large transfers.
The OnDemand File application should not be used for transfers larger than 1GB.
We can improve performace of python calculation by running python in parallel. In this turtorial we will be making use of the multithreading library to run python code in parallel.
Multiprocessing is part of the standard python library distribution on versions python/2.6 and above so no additonal instalation is required (Owens and Pitzer both offer 2.7 and above so this should not be an issue). However, we do recommend you use python environments when using multiple libraries to avoid version conflicts with different projects you may have. See here for more information.
One way to parallelizing is by created a parallel pool. This can be done by using the Pool
method:
p = Pool(10)
This will create a pool of 10 worker processes.
Once you have a pool of worker processes created you can then use the map
method to assign tasks to each worker.
p.map(my_function, something_iterable)
Here is an example python code:
from multiprocessing import Pool from timeit import default_timer as timer import time def sleep_func(x): time.sleep(x) if __name__ == '__main__': arr = [1,1,1,1,1] # create a pool of 5 worker processes p = Pool(5) start = timer() # assign sleep_func to a worker for each entry in arr. # each array entry is passed as an argument to sleep_func p.map(sleep_func, arr) print("parallel time: ", timer() - start) start = timer() # run the functions again but in serial for a in arr: sleep_func(a) print("serial time: ", timer() - start)
The above code was then submitted using the below job script:
#!/bin/bash #SBATCH --account <your-project-id> #SBATCH --job-name Python_ExampleJob #SBATCH --nodes=1 #SBATCH --time=00:10:00 module load python python example_pool.py
After submitting the above job, the following was the output:
parallel time: 1.003282466903329 serial time: 5.005984931252897
See the documenation for more details and examples on using Pool
.
The mutiprocessing library also provides the Process
method to run functions asynchronously.
To create a Process object you can simply make a call to:
proc = Process(target=my_function, args=[my_function, arguments, go, here])
The target
is set equal to the name of your function which you want to run asynchronously and args
is a list of arguement for your function.
Start running a process asynchronously by:
proc.start()
Doing so will begin running the function in another process and the main parent process will continue in its execution.
You can make the parent process wait for a child process to finish with:
proc.join()
If you use proc.run()
it will run your process and wait for it to finish before continuing on in executing the parent process.
Note: The below code will start proc2
only after proc1
has finshed. If you want to start multiple processes and wait for them use start()
and join()
instead of run.
proc1.run() proc2.run()
Here some example code:
from multiprocessing import Process from timeit import default_timer as timer import time def sleep_func(x): print(f'Sleeping for {x} sec') time.sleep(x) if __name__ == '__main__': # initialize process objects proc1 = Process(target=sleep_func, args=[1]) proc2 = Process(target=sleep_func, args=[1]) # begin timer start = timer() # start processes proc1.start() proc2.start() # wait for both process to finish proc1.join() proc2.join() print('Time: ', timer() - start)
Running this code give the following output:
Sleeping for 1 sec Sleeping for 1 sec Time: 1.0275288447737694
You can create a many process easily in loop aswell:
from multiprocessing import Process from timeit import default_timer as timer import time def sleep_func(x): print(f'Sleeping for {x} sec') time.sleep(x) if __name__ == '__main__': # empty list to later store processes processes = [] # start timer start = timer() for i in range(10): # initialize and start processes p = Process(target=sleep_func, args=[5]) p.start() # add the processes to list for later reference processes.append(p) # wait for processes to finish. # we cannot join() them within the same loop above because it would # wait for the process to finish before looping and creating the next one. # So it would be the same as running them sequentially. for p in processes: p.join() print('Time: ', timer() - start)
Output:
Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Sleeping for 5 sec Time: 5.069192241877317
See documentation for more information and example on using Process
.
When running process in parallel it is generally best to avoid sharing states between processes. However, if data must be shared see documentation for more information and examples on how to safely share data.
This page outlines a way a professor can set up a file submission system at OSC for his/her classroom project.
After connecting to OSC system, professor runs submit_prepare
as
$ /users/PZS0645/support/bin/submit_prepare
Follow the instruction and provided the needed information (name of the assignment, TA username if appropriate, a size limit if not the default 1000MB per student, and whether or not you want the email notification of a submit). It will create a designated directory where students submit their assignments, as well as generate submit
for students used to submit homework to OSC, both of which are located in the directory specified by the professor.
If you want to create multiple directories for different assignments, simply run the following command again with specifying the different assignment number:
$ /users/PZS0645/support/bin/submit_prepare
The PI can also enforce the deadline by simply changing the permission of the submission directory or renaming the submission directory at the deadline.
(Only works on Owens): One way is to use at
command following the steps below:
at
command to specify the deadline:at [TIME]
where TIME
is formatted HH:MM AM/PM MM/DD/YY. For example:
at 2:30 PM 08/21/2017
$ chmod 700 [DIRECTORY]
where DIRECTORY
is the assignment folder to be closed off.
The permission of DIRECTORY
will be changed to 700 at 2:30PM, August 21, 2018. After that, the student will get an error message when he/she tries to submit an assignment to this directory.
A student should create one directory which includes all the files he/she wants to submit before running this script to submit his/her assignment. Also, the previous submission of the same assignment from the student will be replaced by the new submission.
To submit the assignment, the student runs submit
after connecting to OSC system as
$ /path/to/directory/from/professor/submit
Follow the instructions. It will allow students to submit an assignment to the designated directory specified by the professor and send a confirmation email, or return an error message.
Often users want to submit a large number of jobs all at once, with each using different parameters for each job. These parameters could be anything, including the path of a data file or different input values for a program. This how-to will show you how you can do this using a simple python script, a CSV file, and a template script. You will need to adapt this advice for your own situation.
Consider the following batch script:
#!/bin/bash #SBATCH --ntasks-per-node=2 #SBATCH --time=1:00:00 #SBATCH --job-name=week42_data8 # Copy input data to the nodes fast local disk cp ~/week42/data/source1/data8.in $TMPDIR cd $TMPDIR # Run the analysis full_analysis data8.in data8.out # Copy results to proper folder cp data8.out ~/week42/results
Let's say you need to submit 100 of these jobs on a weekly basis. Each job uses a different data file as input. You recieve data from two different sources, and so your data is located within two different folders. All of the jobs from one week need to store their results in a single weekly results folder. The output file name is based upon the input file name.
As you can see, this job follows a general template. There are three main parameters that change in each job:
full_analysis
If we replace these parameters with variables, prefixed by the dollar sign $
and surrounded by curly braces { }
, we get the following template script:
#!/bin/bash #SBATCH --ntasks-per-node=2 #SBATCH --time=1:00:00 # Copy input data to the nodes fast local disk cp ~/${WEEK}/data/${SOURCE}/${DATA}.in $TMPDIR cd $TMPDIR # Run the analysis full_analysis ${DATA}.in ${DATA}.out # Copy results to proper folder cp ${DATA}.out ~/${WEEK}/results
We can now use the sbatch --export
option to pass parameters to our template script. The format for passing parameters is:
sbatch --job-name=name --export=var_name=value[,var_name=value...]
Submitting 100 jobs using the sbatch --export
option manually does not make our task much easier than modifying and submitting each job one by one. To complete our task we need to automate the submission of our jobs. We will do this by using a python script that submits our jobs using parameters it reads from a CSV file.
Note that python was chosen for this task for its general ease of use and understandability -- if you feel more comfortable using another scripting language feel free to interpret/translate this python code for your own use.
The script for submitting multiple jobs using parameters can be found at ~support/share/misc/submit_jobs.py
Use the following command to run a test with the examples already created:
<your-proj-code>
with a project you are a member of to charge jobs to.~support/share/misc/submit_jobs.py -t ~support/share/misc/submit_jobs_examples/job_template2.sh WEEK,SOURCE,DATA ~support/share/misc/submit_jobs_examples/parameters_example2.csv <your-proj-code>
This script will open the CSV file and step through the file line by line, submitting a job for each line using the line's values. If the submit command returns a non-zero exit code, usually indicating it was not submitted, we will print this out to the display. The jobs will be submitted using the general format (using the example WEEK,SOURCE,DATA environment variables):
sbatch -A <project-account> -o ~/x/job_logs/x_y_z.job_log --job-name=x_y_z --export=WEEK=x,SOURCE=y,DATA=z job.sh
Where x, y and z are determined by the values in the CSV parameter file. Below we relate x to week, y to source and z to data.
We now need to create a CSV file with parameters for each job. This can be done with a regular text editor or using a spreadsheet editor such as Excel. By default you should use commas as your delimiter.
Here is our CSV file with parameters:
week42,source1,data1 week42,source1,data2 week42,source1,data3 ... week42,source2,data98 week42,source2,data99 week42,source2,data100
The submit script would read in the first row of this CSV file and form and execute the command:
sbatch -A <project-account> -o week42/job_logs/week42_source1_data1.job_log --job-name=week42_source1_data1 --export=WEEK=week42,SOURCE=source1,DATA=data1 job.sh
Once all the above is done, all you need to do to submit your jobs is to make sure the CSV file is populated with the proper parameters and run the automatic submission script with the right flags.
Try using submit_jobs.py --help
for an explanation:
$ ~support/share/misc/submit_jobs.py --help usage: submit_jobs.py [-h] [-t] jobscript parameter_names job_parameters_file account Automatically submit jobs using a csv file; examples in ~support/share/misc/submit_jobs_examples/ positional arguments: jobscript job script to use parameter_names comma separated list of names for each parameter job_parameters_file csv parameter file to use account project account to charge jobs to optional arguments: -h, --help show this help message and exit -t, --test test script without submitting jobs
-t
flag as well to check the submit commands.It is a good idea to copy the ~support/share/misc/submit_jobs.py
file and modify for unique use cases.
Contact oschelp@osc.edu and OSC staff can assist if there are questions using the default script or adjusting the script for unique use cases.
This tutorial presents techniques to tune the performance of an application. Keep in mind that correctness of results, code readability/maintainability, and portability to future systems are more important than performance. For a big picture view, you can check the status of a node while a job is running by visiting the OSC grafana page and using the "cluster metrics" report, and you can use the online interactive tool XDMoD to look at resource usage information for a job.
Some application software specific factors that can affect performance are
We will be using this code based on the HPCCD miniapp from Mantevo. It performs the Conjugate Gradient (CG) on a 3D chimney domain. CG is an iterative algorithm to numerically approximate the solution to a system of linear equations.
Run code with:
srun -n <numprocs> ./test_HPCCG nx ny nz
where nx, ny, nz are the number of nodes in the x, y, and z dimension on each processor.
First start an interactive Pitzer Desktop session with OnDemand.
You need to load intel 19.0.5 and mvapich2 2.3.3:
module load intel/19.0.5 mvapich2/2.3.3
Then clone the repository:
git clone https://code.osu.edu/khuvis.1/performance_handson.git
Debuggers let you execute your program one line at a time, inspect variable values, stop your programming at a particular line, and open a core file after the program crashes.
For debugging, use the -g flag and remove optimzation or set to -O0. For example:
icc -g -o mycode.c
gcc -g -O0 -o mycode mycode.c
To see compiler warnings and diagnostic options:
icc -help diag
man gcc
ARM DDT is a commercial debugger produced by ARM. It can be loaded on all OSC clusters:
module load arm-ddt
To run a non-MPI program from the command line:
ddt --offline --no-mpi ./mycode [args]
To run an MPI program from the command line:
ddt --offline -np num.procs ./mycode [args]
Compile and run the code:
make
srun -n 2 ./test_HPCCG 150 150 150
You should have received the following error message at the end of the program output:
=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 308893 RUNNING AT p0200 = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== YOUR APPPLICATIN TERMINATED WITH EXIT STRING: Segmentation fault (signal 11) This typically referes to a problem with your application. Please see tthe FAQ page for debugging suggestions
Set compiler flags -O0 -g to CPP_OPT_FLAGS in Makefile. Then recompile and run with ARM DDT:
make clean; make module load arm-ddt ddt -np 2 ./test_HPCCG 150 150 150
When DDT stops on the segmentation fault, the stack is in the YAML_Element::~YAML_Element function of YAML_Element.cpp. Looking at this function, we see that the loop stops at children.size() instead of children.size()-1. So, line 13 should be changed from
for(size_t i=0; i<=children.size(); i++) {
to
for(size_t i=0; i<children.size(); i++) {
On Pitzer, there are 40 cores per node (20 cores per socket and 2 sockets per node). There is support for AVX512, vector length 8 double or 16 single precision values and fused multiply-add. (There is hardware support for 4 thread per core, but it is currently not enabled on OSC systems.)
There are three cache levels on Pitzer, and the statistics are shown in the table below:
Cache level | Size (KB) | Latency (cycles) | Max BW (bytes/cycle) | Sustained BW (bytes/cycle) |
---|---|---|---|---|
L1 DCU | 32 | 4-6 | 192 | 133 |
L2 MLC | 1024 | 14 | 64 | 52 |
L3 LLC | 28160 | 50-70 | 16 | 15 |
Never do heavy I/O in your home directory. Home directories are for long-term storage, not scratch files.
One option for I/O intensive jobs is to use the local disk on a compute node. Stage files to and from your home directory into $TMPDIR using the pbsdcp command (e.g. pbsdcp file1 file2 $TMPDIR), and execute the program in $TMPDIR.
Another option is to use the scratch file system ($PFSDIR). This is faster than other file systems, good for parallel jobs, and may be faster than local disk.
For more information about OSC's file system, click here.
For example batch scripts showing the use of $TMPDIR and $PFSDIR, click here.
For more information about Pitzer, click here.
FLOPS stands for "floating point operations per second." Pitzer has a theoretical maximum of 720 teraflops. With the LINPACK benchmark of solving a dense system of linear equations, 543 teraflops. With the STREAM benchmark, which measures sustainable memory bandwidth and the corresponding computation rate for vector kernels, copy: 299095.01 MB/s, scale: 298741.01 MB/s, add: 331719.18 MB/s, and traid: 331712.19 MB/s. Application performance is typically much less than peak/sustained performance since applications usually do not take full advantage of all hardware features.
You can time a program using the /usr/bin/time command. It gives results for user time (CPU time spent running your program), system time (CPU time spent by your program in system calls), and elapsed time (wallclock). It also shows % CPU, which is (user + system) / elapsed, as well as memory, pagefault, swap, and I/O statistics.
/usr/bin/time j3
5415.03user 13.75system 1:30:29elapsed 99%CPU \
(0avgtext+0avgdata 0maxresident)k \
0inputs+0outputs (255major+509333minor)pagefaults 0 swaps
You can also time portions of your code:
C/C++ | Fortran 77/90 | MPI (C/C++/Fortran) | |
---|---|---|---|
Wallclock |
time(2), difftime(3), getrusage(2) |
SYSTEM_CLOCK(2) | MPI_Wtime(3) |
CPU | times(2) | DTIME(3), ETIME(3) | X |
A profiler can show you whether code is compute-bound, memory-bound, or communication bound. Also, it shows how well the code uses available resources and how much time is spent in different parts of your code. OSC has the following profiling tools: ARM Performance Reports, ARM MAP, Intel VTune, Intel Trace Analyzer and Collector (ITAC), Intel Advisor, TAU Commander, and HPCToolkit.
For profiling, use the -g flag and specify the same optimization level that you normally would normally use with -On. For example:
icc -g -O3 -o mycode mycode.c
Look for
ARM PR works on precompiled binaries, so the -g flag is not needed. It gives a summary of your code's performance that you can view with a browser.
For a non-MPI program:
module load arm-pr
perf-report --no-mpi ./mycode [args]
For an MPI program:
module load arm-pr
perf-report --np num_procs ./mycode [args]
Interpreting this profile requires some expertise. It gives details about your code's performance. You can view and explore the resulting profile using an ARM client.
For a non-MPI program:
module load arm-map
map --no-mpi ./mycode [args]
For an MPI program:
module load arm-pr
map --np num_procs ./mycode [args]
For more information about ARM Tools, view OSC resources or visit ARM's website.
ITAC is a graphical tool for profiling MPI code (Intel MPI).
To use:
module load intelmpi # then compile (-g) code
mpiexec -trace ./mycode
View and explore the results using a GUI with traceanalyzer:
traceanalyzer <mycode>.stf
HPC software is traditionally written in Fortran or C/C++. OSC supports several compiler families. Intel (icc, icpc, ifort) usually gives fastest code on Intel architecture). Portland Group (PGI - pgcc, pgc++, pgf90) is good for GPU programming, OpenACC. GNU (gcc, g++, gfortran) is open source and universally available.
Compiler options are easy to use and let you control aspects of the optimization. Keep in mind that different compilers have different values for options. For all compilers, any highly optimized builds, such as those employing the options herein, should be thoroughly validated for correctness.
Some examples of optimization include:
Compiler flags to try first are:
Faster operations are sometimes less accurate. For Intel compilers, fast math is default with -O2 and -O3. If you have a problem, use -fp-model precise. For GNU compilers, precise math is default with -O2 and -O3. If you want faster performance, use -ffast-math.
Inlining is replacing a subroutine or function call with the actual body of the subprogram. It eliminates overhead of calling the subprogram and allows for more loop optimizations. Inlining for one source file is typically automatic with -O2 and -O3.
Options for Intel compilers are shown below. Don't use -fast for MPI programs with Intel compilers. Use the same compiler command to link for -ipo with separate compilation. Many other optimization options can be found in the man pages. The recommended options are -O3 -xHost. An example is ifort -O3 program.f90.
-fast | Common optimizations |
-On |
Set optimization level (0, 1, 2, 3) |
-ipo | Interprocedural optimization, multiple files |
-O3 | Loop transforms |
-xHost | Use highest instruction set available |
-parallel | Loop auto-parallelization |
Options for PGI compilers are shown below. Use the same compiler command to link for -Mipa with separate compilation. Many other optimization options can be found in the man pages. The recommended option is -fast. An example is pgf90 -fast program.f90.
-fast | Common optimizations |
-On |
Set optimization level (0, 1, 2, 3, 4) |
-Mipa | Interprocedural optimization |
-Mconcur | Loop auto-parallelization |
Options for GNU compilers are shown below. Use the same compiler command to link for -Mipa with separate compilation. Many other optimization options can be found in the man pages. The recommended options are -O3 -ffast-math. An example is gfortran -O3 program.f90.
-On | Set optimization level (0, 1, 2, 3) |
N/A for separate compilation | Interprocedural optimization |
-O3 | Loop transforms |
-ffast-math | Possibly unsafe floating point optimizations |
-march=native | Use highest instruction set available |
Compile and run with different compiler options:
time srun -n 2 ./test_HPCCG 150 150 150
Using the optimal compiler flags, get an overview of the bottlenecks in the code with the ARM performance report:
module load arm-pr
perf-report -np 2 ./test_HPCCG 150 150 150
On Pitzer, sample times were:
Compiler Option | Runtime (seconds) |
---|---|
-g | 129 |
-O0 -g | 129 |
-O1 -g | 74 |
-O2 -g | 74 |
-O3 -g |
74 |
The performance report shows that the code is compute-bound.
Compiler optimization reports let you understand how well the compiler is doing at optimizing your code and what parts of your code need work. They are generated at compile time and describe what optimizations were applied at various points in the source code. The report may tell you why optimizations could not be performed.
For Intel compilers, -qopt-report and outputs to a file.
For Portland Group compilers, -Minfo and outputs to stderr.
For GNU compilers, -fopt-info and ouputs to stderr by default.
A sample output is:
LOOP BEGIN at laplace-good.f(10,7)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at laplace-good.f(11,10)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at laplace-good.f(11,10)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at laplace-good.f(11,10)
<Remainder loop for vectorization>
remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at laplace-good.f(11,10)
<Remainder loop for vectorization>
LOOP END
LOOP END
Add the compiler flag -qopt-report=5 and recompile to view an optimization report.
Code is structured to operate on arrays of operands. Vector instructions are built into the processor. On Pitzer, the vector length is 16 single or 8 double precision. The following is a vectorizable loop:
do i = 1,N a(i) = b(i) + x(1) * c(i) end do
Some things that can inhibit vectorization are:
Use ARM MAP to identify the most expensive parts of the code.
module load arm-map map -np 2 ./test_HPCCG 150 150 150
Check the optimization report previously generated by the compiler (with -qopt-report=5) to see if any of the loops in the regions of the code are not being vectorized. Modify the code to enable vectorization and rerun the code.
Map shows that the most expensive segment of the code is lines 83-84 of HPC_sparsemv.cpp:
for (int j=0; j< cur_nnz; j++) y[i] += cur_vals[j]*x[cur_inds[j]];
The optimization report confirms that the loop was not vectorized due to a dependence on y.
Incrementing a temporary variable instead of y[i], should enable vectorization:
for (int j=0; j< cur_nnz; j++) sum += cur_vals[j]*x[cur_inds[j]]; y[i] = sum;
Recompiling and rerunning with change reduces runtime from 74 seconds to 63 seconds.
Memory access is often the most important factor in your code's performance. Loops that work with arrays should use a stride of one whenever possible. C and C++ are row-major (store elements consecutively by row in 2D arrays), so the first array index should be the outermost loop and the last array index should be the innermost loop. Fortran is column-major, so the reverse is true. You can get factor of 3 or 4 speedup just by using unit stride. Avoid using arrays of derived data types, structs, or classes. For example, use structs of arrays instead of arrays of structures.
Efficient cache usage is important. Cache lines are 8 words (64 bytes) of consecutive memory. The entire cache line is loaded when a piece of data is fetched.
The code below is a good example. 2 cache lines are used for every 8 loop iterations, and it is unit stride:
real*8 a(N), b(N)
do i = 1,N
a(i) = a(i) + b(i)
end do
! 2 cache lines:
! a(1), a(2), a(3) ... a(8)
! b(1), b(2), b(3) ... b(8)
The code below is a bad example. 1 cache line is loaded for each loop iteration, and it is not unit stride:
TYPE :: node
real*8 a, b, c, d, w, x, y, z
END TYPE node
TYPE(node) :: s(N)
do i = 1, N
s(i)%a = s(i)%a + s(i)%b
end do
! cache line:
! a(1), b(1), c(1), d(1), w(1), x(1), y(1), z(1)
Look again at the most expensive parts of the code using ARM MAP:
module load arm-map map -np 2 ./test_HPCCG 150 150 150
Look for any inefficient memory access patterns. Modify the code to improve memory access patterns and rerun the code. Do these changes improve performance?
Lines 110-148 of generate_matrix.cpp are nested loops:
for (int ix=0; ix<nx; ix++) { for (int iy=0; iy<ny; iy++) { for (int iz=0; iz<nz; iz++) { int curlocalrow = iz*nx*ny+iy*nx+ix; int currow = start_row+iz*nx*ny+iy*nx+ix; int nnzrow = 0; (*A)->ptr_to_vals_in_row[curlocalrow] = curvalptr; (*A)->ptr_to_inds_in_row[curlocalrow] = curindptr; . . . } } }
The arrays are accessed in a manner so that consecutive values of ix are accesssed in order. However, our loops are ordered so that the ix is the outer loop. We can reorder the loops so that ix is iterated in the inner loop:
for (int iz=0; iz<nz; iz++) { for (int iy=0; iy<ny; iy++) { for (int ix=0; ix<nx; ix++) { . . . } } }
This reduces the runtime from 63 seconds to 22 seconds.
OpenMP is a shared-memory, threaded parallel programming model. It is a portable standard with a set of compiler directives and a library of support functions. It is supported in compilers by Intel, Portland Group, GNU, and Cray.
The following are parallel loop execution examples in Fortran and C. The inner loop vectorizes while the outer loop executes on multiple threads:
PROGRAM omploop INTEGER, PARAMETER :: N = 1000 INTEGER i, j REAL, DIMENSION(N, N) :: a, b, c, x ... ! Initialize arrays !$OMP PARALLEL DO do j = 1, N do i = 1, N a(i, j) = b(i, j) + x(i, j) * c(i, j) end do end do !$OMP END PARALLEL DO END PROGRAM omploop
int main() { int N = 1000; float *a, *b, *c, *x; ... // Allocate and initialize arrays #pragma omp parallel for for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { a[i*N+j] = b[i*N+j] + x[i*N+j] * c[i*N+j] } } }
You can add an option to compile a program with OpenMP.
For Intel compilers, add the -qopenmp option. For example, ifort -qopenmp ompex.f90 -o ompex.
For GNU compilers, add the -fopenmp option. For example, gcc -fopenmp ompex.c -o ompex.
For Portland group compilers, add the -mp option. For example, pgf90 -mp ompex.f90 -o ompex.
To run an OpenMP program, requires multiple processors through Slurm (--N 1 -n 40) and set the OMP_NUM_THREADS environment variable (default is use all available cores). For the best performance, run at most one thread per core.
An example script is:
#!/bin/bash #SBATCH -J omploop #SBATCH -N 1 #SBATCH -n 40 #SBATCH -t 1:00 export OMP_NUM_THREADS=40 /usr/bin/time ./omploop
For more information, visit http://www.openmp.org, OpenMP Application Program Interface, and self-paced turorials. OSC will host an XSEDE OpenMP workshop on November 5, 2019.
MPI stands for message passing interface for when multiple processes run on one or more nodes. MPI has functions for point-to-point communication (e.g. MPI_Send, MPI_Recv). It also provides a number of functions for typical collective communication patterns, including MPI_Bcast (broadcasts value from root process to all other processes), MPI_Reduce (reduces values on all processes to a single value on a root process), MPI_Allreduce (reduces value on all processes to a single value and distributes the result back to all processes), MPI_Gather (gathers together values from a group of processes to a root process), and MPI_Alltoall (sends data from all processes to all processes).
A simple MPI program is:
#include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, size; MPI_INIT(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_COMM_size(MPI_COMM_WORLD, &size); printf("Hello from node %d of %d\n", rank size); MPI_Finalize(); return(0); }
MPI implementations available at OSC are mvapich2, Intel MPI (only for Intel compilers), and OpenMPI.
MPI programs can be compiled with MPI compiler wrappers (mpicc, mpicxx, mpif90). They accept the same arguments as the compilers they wrap. For example, mpicc -o hello hello.c.
MPI programs must run in batch only. Debugging runs may be done with interactive batch jobs. srun automatically determines exectuion nodes from PBS:
#!/bin/bash #SBATCH -J mpi_hello #SBATCH -N 2 #SBATCH --ntasks-per-node=40 #SBATCH -t 1:00 cd $PBS_O_WORKDIR srun ./hello
For more information about MPI, visit MPI Forum and MPI: A Message-Passing Interface Standard. OSC will host an XSEDE MPI workshop on September 3-4, 2019. Self-paced tutorials are available here.
Use ITAC to get a timeline of the run of the code.
module load intelmpi LD_PRELOAD=libVT.so \ mpiexec -trace -np 40 ./test_HPCCG 150 150 150 traceanalyzer <stf_file>
Look at the Event Timeline (under Charts). Do you see any communication patterns that could be replaced by a single MPI command?
Looking at the Event Timeline, we see that a large part of runtime is spent in the following communication pattern: MPI_Barrier, MPI_Send/MPI_Recv, MPI_Barrier. We also see that during this communication rank 0 is sending data to all other rank. We should be able to replace all of these MPI calls with a single call to MPI_Bcast.
The relavent code is in lines 82-89 of ddot.cpp:
MPI_Barrier(MPI_COMM_WORLD); if(rank == 0) { for(int dst_rank=1; dst_rank < size; dst_rank++) { MPI_Send(&global_result, 1, MPI_DOUBLE, dst_rank, 1, MPI_COMM_WORLD); } } if(rank != 0) MPI_Recv(&global_result, 1, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Barrier(MPI_COMM_WORLD);
and can be replaced with:
MPI_Bcast(&global_result, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
Although many of the tools we already mentioned can also be used with interpreted languages, most interpreted languages such as Python and R have their own profiling tools.
Since they are still running on th same hardware, the performance considerations are very similar for interpreted languages as they are for compiled languages:
One of Python's most common profiling tools is cProfile. The simplest way to use cProfile is to add several arguments to your Python call so that an ordered list of the time spent in all functions called during executation. For instance, if a program is typically run with the command:
python ./mycode.py
replace that with
python -m cProfile -s time ./mycode.py
Here is a sample output from this profiler:
See Python's documentation for more details on how to use cProfile.
One of the most popular profilers for R is profvis. It is not available by default with R so it will need to be installed locally before its first use and loaded into your environment prior to each use. To profile your code, just put how you would usually call your code as the argument into profvis:
$ R > install.packages('profvis') > library('profvis') > profvis({source('mycode.R')}
Here is a sample output from profvis:
For more information on profvis is available here.
First, enter the Python/ subdirectory of the code containing the python script ns.py. Profile this code with cProfile to determine the most expensive functions of the code. Next, rerun and profile with the array as an argument to ns.py. Which versions runs faster? Can you determine why it runs faster?
Execute the following commands:
python -m cProfile -s time ./ns.py python -m cProfile -s time ./ns.py array
In the original code, 66 seconds out 68 seconds are spent in presPoissPeriodic. When the array argument is passed, the time spent in this function is approximately 1 second and the total runtime goes down to about 2 seconds.
The speedup comes from the vectorization of the main computation in the body of presPoissPeriodic by replacing nester for loops with a single like operation on arrays.
Now, enter the R/ subdirectory of the code containing the R script lu.R. Make sure that you have the R module loaded. First, run the code with profvis without any additional arguments and then again with frmt="matrix".
Which version of the code runs faster? Can you tell why it runs faster based on the profile?
Runtime for the default version is 28 seconds while the runtime when frmt="matrix" is 20 seconds.
Here is the profile with default arguments:
And here is the profile with frmt="matrix":
We can see that most of the time is being spent in lu_decomposition. The difference, however, is that the dataframe version seems to have a much higher overhead associated with accessing elements of the dataframe. On the other hand, the profile of the matrix version seems to be much flatter with fewer functions being called during LU decomposition. This reduction in overhead by using a matrix instead of a dataframe results in the better performance.
This article discusses memory tuning strategies for VASP.
Typically the first approach for memory sensitive VASP issues is to tweak the data distribution (via NCORE or NPAR). The information at https://www.vasp.at/wiki/index.php/NPAR covers a variety of machines. OSC has fast communications via Infiniband.
Performance and memory consumption are dependent on the simulation model. So we recommend a series of benchmarks varying the number of nodes and NCORE. The recommended initial value for NCORE is the processor count per node which is the ntasks-per-node value in Slurm (the ppn value in PBS). Of course, if this benchmarking is intractable then one must reexamine the model. For general points see: https://www.vasp.at/wiki/index.php/Memory_requirements and https://www.vasp.at/wiki/index.php/Not_enough_memory And of course one should start small and incrementally improve or scale up one's model.
Using the key parameters with respect to memory scaling listed at the VASP memory requirements page one can rationalize VASP memory usage. The general approach is to study working calculations and then apply that understanding to scaled up or failing calculations. This might help one identify if a calculation is close to a node's memory limit and happens to cross over the limit for reasons that might be out of ones control, in which case one might need to switch to higher memory nodes.
Here is an example of rationalizing memory consumption. Extract from a simulation output the key parameters:
Dimension of arrays: k-points NKPTS = 18 k-points in BZ NKDIM = 18 number of bands NBANDS= 1344 total plane-waves NPLWV = 752640 ... dimension x,y,z NGXF= 160 NGYF= 168 NGZF= 224 support grid NGXF= 320 NGYF= 336 NGZF= 448
This yields 273 GB of memory, NKDIM*NBANDS*NPLWV*16 + 4*(NGXF/2+1)*NGYF*NGZF*16, according to
https://www.vasp.at/wiki/index.php/Memory_requirements
This estimate should be compared to actual memory reports. See for example XDModD and grafana. Note that most application software has an overhead in the ballpack of ten to twenty percent. In addition, disk caching can consume significant memory. Thus, one must adjust the memory estimate upward. It can then be comapred to the available memory per cluster and per cluster node type.
rclone
is a tool that can be used to upload and download files to a cloud storage (like Microsoft OneDrive, BuckeyeBox) from the command line. It's shipped as a standalone binary, but requires some user configuration before using. In this page, we will provide instructions on how to use rclone
to upload data to OneDrive. For instructions with other cloud storage, check rclone
Online documentation.
Before configuration, please first log into OSC OnDemand and request a Pitzer Lightweight Desktop session. Walltime of 1 hour should be sufficient to finish the configuration.
Once the session is ready, open a terminal. In the terminal, run the command
rclone config
It prompts you with a bunch of questions:
Create an empty hello.txt
file and upload it to OneDrive using 'rclone copy' as below in a terminal:
touch hello.txt rclone copy hello.txt OneDrive:/test
This creates a toplevel directory in OneDrive called 'test' if it does not already exist, and uploads the file hello.txt
to it.
To verify the uploading is successful, you can either login to OneDrive in a web browser to check the file, or use rclone ls
command in the terminal as:
rclone ls OneDrive:/test
ls
on a large directory, because it's recursive. You can add a '--max-depth 1' flag to stop the recursion. Copy the contents of a source directory from a configured OneDrive remote, OneDrive:/src/dir/path
, into a destination directory in your OSC session, /dest/dir/path
, using the code below:
rclone copy OneDrive:/src/dir/path /dest/dir/path
Identical files on the source and destination directories are not transferred. Only the contents of the provided source directory are copied, not the directory name and contents.
copy
does not delete files from the destination. To delete files from the destination directory in order to match the source directory, use the sync
command instead.
If only one file is being transferred, use the copyto
command instead.
--no-traverse
option can be used to increase efficiency by stopping rclone from listing the destination. It should be used when copying a small number of files and/or have a large number of files on the destination, but not when a large number of files are being copied.rclone ls OneDrive:/path/to/shared_folder
and rclone copy OneDrive:/path/to/shared_folder /dest/dir/path
will work normally even though the shared folder does not appear when listing their source directory.If rclone remains unused for 90 days, the refresh token will expire, leading to issues with authorization. This can be easily resolved by executing the rclone config reconnect remote:
command, which generates a fresh token and refresh token.
It's important to note OneDrive is case insensitive which prohibits the coexistence files such as "Hello.doc" and "hello.doc". Certain characters are prohibited from being in OneDrive filenames and are commonly encountered on non-Windows platforms. Rclone addresses this by converting these filenames to their visually equivalent Unicode alternatives.
The largest allowed file size is 250 GiB for both OneDrive Personal and OneDrive for Business (Updated 13 Jan 2021).
The entire path, including the file name, must contain fewer than 400 characters for OneDrive, OneDrive for Business and SharePoint Online. It is important to know the limitation when encrypting file and folder names with rclone, as the encrypted names are typically longer than the original ones.
OneDrive seems to be OK with at least 50,000 files in a folder, but at 100,000 rclone will get errors listing the directory like couldn’t list files: UnknownError:
.
rclone
is a tool that can be used to upload and download files to a cloud storage (like Microsoft OneDrive, BuckeyeBox) from the command line. It's shipped as a standalone binary, but requires some user configuration before using. In this page, we will provide instructions on how to use rclone
to upload data from Google Drive. For instructions with other cloud storage, check rclone
Online documentation.
Before configuration, please first log into OSC OnDemand and request a Pitzer Lightweight Desktop session. Walltime of 1 hour should be sufficient to finish the configuration.
Once the session is ready, open a terminal. In the terminal, run the command
rclone config
It prompts you with a bunch of questions:
Copy the contents of a source directory from a configured OneDrive remote, GDrive:/src/dir/path
, into a destination directory in your OSC session, /dest/dir/path
, using the code below:
rclone copy GDrive:/src/dir/path /dest/dir/path --progress
Identical files on the source and destination directories are not transferred. Only the contents of the provided source directory are copied, not the directory name and contents.
copy
does not delete files from the destination. To delete files from the destination directory in order to match the source directory, use the sync
command instead.
If only one file is being transferred, use the copyto
command instead.
--no-traverse
option can be used to increase efficiency by stopping rclone from listing the destination. It should be used when copying a small number of files and/or have a large number of files on the destination, but not when a large number of files are being copied.rclone ls GDrive:/path/to/shared_folder
and rclone copy GDrive:/path/to/shared_folder /dest/dir/path
will work normally even though the shared folder does not appear when listing their source directory.If rclone remains unused for 90 days, the refresh token will expire, leading to issues with authorization. This can be easily resolved by executing the rclone config reconnect remote:
command, which generates a fresh token and refresh token.
It's important to note Google Drive is case insensitive which prohibits the coexistence files such as "Hello.doc" and "hello.doc". Certain characters are prohibited from being in Google Drive filenames and are commonly encountered on non-Windows platforms. Rclone addresses this by converting these filenames to their visually equivalent Unicode alternatives.
Address Sanitizer is a tool developed by Google detect memory access error such as use-after-free and memory leaks. It is built into GCC versions >= 4.8 and can be used on both C and C++ codes. Address Sanitizer uses runtime instrumentation to track memory allocations, which mean you must build your code with Address Sanitizer to take advantage of it's features.
There is extensive documentation on the AddressSanitizer Github Wiki.
Memory leaks can increase the total memory used by your program. It's important to properly free memory when it's no longer required. For small programs, loosing a few bytes here and there may not seem like a big deal. However, for long running programs that use gigabytes of memory, avoiding memory leaks becomes increasingly vital. If your program fails to free the memory it uses when it no longer needs it, it can run out of memory, resulting in early termination of the application. AddressSanitizer can help detect these memory leaks.
Additionally, AddressSanitizer can detect use-after-free bugs. A use-after-free bug occurs when a program tries to read or write to memory that has already been freed. This is undefined behavior and can lead to corrupted data, incorrect results, and even program crashes.
We need to use gcc to build our code, so we'll load the gcc module:
module load gnu/9.1.0
The "-fsanitize=address" flag is used to tell the compiler to add AddressSanitizer.
Additionally, due to some environmental configuration settings on OSC systems, we must also statically link against Asan. This is done using the "-static-libasan" flag.
It's helpful to compile the code with debug symbols. AddressSanitizer will print line numbers if debug symbols are present. To do this, add the "-g" flag. Additionally, the "-fno-omit-frame-pointer" flag may be helpful if you find that your stack traces do not look quite correct.
In one command, this looks like:
gcc main.c -o main -fsanitize=address -static-libasan -g
Or, splitting into separate compiling and linking stages:
gcc -c main.c -fsanitize=address -g gcc main.o -o main -fsanitize=address -static-libasan
Notice that both the compilation and linking steps require the "-fsanitize-address" flag, but only the linking step requires "-static-libasan". If your build system is more complex, it might make sense to put these flags in CFLAGS and LDFLAGS environment variables.
And that's it!
First, let's look at a program that has no memory leaks (noleak.c):
#include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, const char *argv[]) { char *s = malloc(100); strcpy(s, "Hello world!"); printf("string is: %s\n", s); free(s); return 0; }
To build this we run:
gcc noleak.c -o noleak -fsanitize=address -static-libasan -g
And, the output we get after running it:
string is: Hello world!
That looks correct! Since there are no memory leaks in this program, AddressSanitizer did not print anything. But, what happens if there are leaks?
Let's look at the above program again, but this time, remove the free call (leak.c):
#include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, const char *argv[]) { char *s = malloc(100); strcpy(s, "Hello world!"); printf("string is: %s\n", s); return 0; }
Then, to build:
gcc leak.c -o leak -fsanitize=address -static-libasan
And the output:
string is: Hello world! ================================================================= ==235624==ERROR: LeakSanitizer: detected memory leaks Direct leak of 100 byte(s) in 1 object(s) allocated from: #0 0x4eaaa8 in __interceptor_malloc ../../.././libsanitizer/asan/asan_malloc_linux.cc:144 #1 0x5283dd in main /users/PZS0710/edanish/test/asan/leak.c:6 #2 0x2b0c29909544 in __libc_start_main (/lib64/libc.so.6+0x22544) SUMMARY: AddressSanitizer: 100 byte(s) leaked in 1 allocation(s).
This is a leak report from AddressSanitizer. It detected that 100 bytes were allocated, but never freed. Looking at the stack trace that it provides, we can see that the memory was allocated on line 6 in leak.c
Say we found the above leak in our code, and we wanted to fix it. We need to add a call to free. But, what if we add it in the wrong spot?
#include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, const char *argv[]) { char *s = malloc(100); free(s); strcpy(s, "Hello world!"); printf("string is: %s\n", s); return 0; }
The above (uaf.c) is clearly wrong. Albiet a contrived example, the allocated memory, pointed to by "s", was written to and read from after it was freed.
To Build:
gcc uaf.c -o uaf -fsanitize=address -static-libasan
Building it and running it, we get the following report from AddressSanitizer:
================================================================= ==244157==ERROR: AddressSanitizer: heap-use-after-free on address 0x60b0000000f0 at pc 0x00000047a560 bp 0x7ffcdf0d59f0 sp 0x7ffcdf0d51a0 WRITE of size 13 at 0x60b0000000f0 thread T0 #0 0x47a55f in __interceptor_memcpy ../../.././libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790 #1 0x528403 in main /users/PZS0710/edanish/test/asan/uaf.c:8 #2 0x2b47dd204544 in __libc_start_main (/lib64/libc.so.6+0x22544) #3 0x405f5c (/users/PZS0710/edanish/test/asan/uaf+0x405f5c) 0x60b0000000f0 is located 0 bytes inside of 100-byte region [0x60b0000000f0,0x60b000000154) freed by thread T0 here: #0 0x4ea6f7 in __interceptor_free ../../.././libsanitizer/asan/asan_malloc_linux.cc:122 #1 0x5283ed in main /users/PZS0710/edanish/test/asan/uaf.c:7 #2 0x2b47dd204544 in __libc_start_main (/lib64/libc.so.6+0x22544) previously allocated by thread T0 here: #0 0x4eaaa8 in __interceptor_malloc ../../.././libsanitizer/asan/asan_malloc_linux.cc:144 #1 0x5283dd in main /users/PZS0710/edanish/test/asan/uaf.c:6 #2 0x2b47dd204544 in __libc_start_main (/lib64/libc.so.6+0x22544) SUMMARY: AddressSanitizer: heap-use-after-free ../../.././libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790 in __interceptor_memcpy Shadow bytes around the buggy address: 0x0c167fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c167fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c167fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c167fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c167fff8000: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd =>0x0c167fff8010: fd fd fd fd fd fa fa fa fa fa fa fa fa fa[fd]fd 0x0c167fff8020: fd fd fd fd fd fd fd fd fd fd fd fa fa fa fa fa 0x0c167fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c167fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c167fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c167fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==244157==ABORTING
This is a bit intimidating. It looks like there's alot going on here, but it's not as bad as it looks. Starting at the top, we see what AddressSanitizer detected. In this case, a "WRITE" of 13 bytes (from our strcpy). Immediately below that, we get a stack trace of where the write occured. This tells us that the write occured on line 8 in uaf.c in the function called "main".
Next, AddressSanitizer reports where the memory was located. We can ignore this for now, but depending on your use case, it could be helpful information.
Two key pieces of information follow. AddressSanitizer tells us where the memory was freed (the "freed by thread T0 here" section), giving us another stack trace indicating the memory was freed on line 7. Then, it reports where it was originally allocated ("previously allocated by thread T0 here:"), line 6 in uaf.c.
This is likely enough information to start to debug the issue. The rest of the report provides details about how the memory is laid out, and exactly which addresses were accessed/written to. You probably won't need to pay too much attention to this section. It's a bit "down in the weeds" for most use cases.
AddresssSanitizer can also detect heap overflows. Consider the following code (overflow.c):
#include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, const char *argv[]) { // whoops, forgot c strings are null-terminated // and not enough memory was allocated for the copy char *s = malloc(12); strcpy(s, "Hello world!"); printf("string is: %s\n", s); free(s); return 0; }
The "Hello world!" string is 13 characters long including the null terminator, but we've only allocated 12 bytes, so the strcpy above will overflow the buffer that was allocated. To build this:
gcc overflow.c -o overflow -fsanitize=address -static-libasan -g -Wall
Then, running it, we get the following report from AddressSanitizer:
==168232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200000003c at pc 0x000000423454 bp 0x7ffdd58700e0 sp 0x7ffdd586f890 WRITE of size 13 at 0x60200000003c thread T0 #0 0x423453 in __interceptor_memcpy /apps_src/gnu/8.4.0/src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:737 #1 0x5097c9 in main /users/PZS0710/edanish/test/asan/overflow.c:8 #2 0x2ad93cbd7544 in __libc_start_main (/lib64/libc.so.6+0x22544) #3 0x405d7b (/users/PZS0710/edanish/test/asan/overflow+0x405d7b) 0x60200000003c is located 0 bytes to the right of 12-byte region [0x602000000030,0x60200000003c) allocated by thread T0 here: #0 0x4cd5d0 in __interceptor_malloc /apps_src/gnu/8.4.0/src/libsanitizer/asan/asan_malloc_linux.cc:86 #1 0x5097af in main /users/PZS0710/edanish/test/asan/overflow.c:7 #2 0x2ad93cbd7544 in __libc_start_main (/lib64/libc.so.6+0x22544) SUMMARY: AddressSanitizer: heap-buffer-overflow /apps_src/gnu/8.4.0/src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:737 in __interceptor_memcpy Shadow bytes around the buggy address: 0x0c047fff7fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c047fff7fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c047fff7fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c047fff7fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c047fff7ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0c047fff8000: fa fa 00 fa fa fa 00[04]fa fa fa fa fa fa fa fa 0x0c047fff8010: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff8020: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff8030: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff8040: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c047fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==168232==ABORTING
This is similar to the use-after-free report we looked at above. It tells us that a heap buffer overflow occured, then goes on to report where the write happened and where the memory was originally allocated. Again, the rest of this report describes the layout of the heap, and probably isn't too important for your use case.
AddressSanitizer can be used on C++ codes as well. Consider the following (bad_delete.cxx):
#include <iostream> #include <cstring> int main(int argc, const char *argv[]) { char *cstr = new char[100]; strcpy(cstr, "Hello World"); std::cout << cstr << std::endl; delete cstr; return 0; }
What's the problem here? The memory pointed to by "cstr" was allocated with new[]. An array allocation must be deleted with the delete[] operator, not "delete".
To build this code, just use g++ instead of gcc:
g++ bad_delete.cxx -o bad_delete -fsanitize=address -static-libasan -g
And running it, we get the following output:
Hello World ================================================================= ==257438==ERROR: AddressSanitizer: alloc-dealloc-mismatch (operator new [] vs operator delete) on 0x60b000000040 #0 0x4d0a78 in operator delete(void*, unsigned long) /apps_src/gnu/8.4.0/src/libsanitizer/asan/asan_new_delete.cc:151 #1 0x509ea8 in main /users/PZS0710/edanish/test/asan/bad_delete.cxx:9 #2 0x2b8232878544 in __libc_start_main (/lib64/libc.so.6+0x22544) #3 0x40642b (/users/PZS0710/edanish/test/asan/bad_delete+0x40642b) 0x60b000000040 is located 0 bytes inside of 100-byte region [0x60b000000040,0x60b0000000a4) allocated by thread T0 here: #0 0x4cf840 in operator new[](unsigned long) /apps_src/gnu/8.4.0/src/libsanitizer/asan/asan_new_delete.cc:93 #1 0x509e5f in main /users/PZS0710/edanish/test/asan/bad_delete.cxx:5 #2 0x2b8232878544 in __libc_start_main (/lib64/libc.so.6+0x22544) SUMMARY: AddressSanitizer: alloc-dealloc-mismatch /apps_src/gnu/8.4.0/src/libsanitizer/asan/asan_new_delete.cc:151 in operator delete(void*, unsigned long) ==257438==HINT: if you don't care about these errors you may set ASAN_OPTIONS=alloc_dealloc_mismatch=0 ==257438==ABORTING
This is similar to the other AddressSanitizer outputs we've looked at. This time, it tells us there's a mismatch between new and delete. It prints a stack trace for where the delete occured (line 9) and also a stack trace for where to allocation occured (line 5).
The documentation states:
This tool is very fast. The average slowdown of the instrumented program is ~2x
AddressSanitizer is much faster than tools that do similar analysis such as valgrind. This allows for usage on HPC codes.
However, if you find that AddressSanitizer is too slow for your code, there are compiler flags that can be used to disable it for specific functions. This way, you can use address sanitizer on cooler parts of your code, while manually auditing the hot paths.
The compiler directive to skip analyzing functions is:
__attribute__((no_sanitize_address)
It is possible to utilize Cron and the OSCusage command to send regular usage reports via email
It is easy to create Cron jobs on the Owens and Pitzer clusters at OSC. Cron is a Linux utility which allows the user to schedule a command or script to run automatically at a specific date and time. A cron job is the task that is scheduled.
Shell scripts run as a cron job are usually used to update and modify files or databases; however, they can perform other tasks, for example a cron job can send an email notification.
In order to use what cron has to offer, here is a list of the command name and options that can be used
Usage: crontab [options] file crontab [options] crontab -n [hostname] Options: -udefine user -e edit user's crontab -l list user's crontab -r delete user's crontab -i prompt before deleting -n set host in cluster to run users' crontabs -c get host in cluster to run users' crontabs -s selinux context -x enable debugging
crontab -l
crontab -e
MIN HOUR DOM MON DOW CMD
* * * * * {cmd} | mail -s "title of the email notification" {your email}
12 15 * * * /opt/osc/bin/OSCusage | mail -s "OSC usage on $(date)" {your email} 2> /path/to/file/for/stdout/and/stderr 2>&1
$ /opt/osc/bin/OSCusage --help usage: OSCusage.py [-h] [-u USER] [-s {opt,pitzer,glenn,bale,oak,oakley,owens,ruby}] [-A] [-P PROJECT] [-q] [-H] [-r] [-n] [-v] [start_date] [end_date] positional arguments: start_date start date (default: 2020-04-23) end_date end date (default: 2020-04-24) optional arguments: -h, --help show this help message and exit -u USER, --user USER username to run as. Be sure to include -P or -A. (default: kalattar) -s {opt,pitzer,glenn,bale,oak,oakley,owens,ruby}, --system {opt,pitzer,glenn,bale,oak,oakle -A Show all -P PROJECT, --project PROJECT project to query (default: PZS0715) -q show user data -H show hours -r show raw -n show job ID -v do not summarize
OSCusage 2018-01-24
OSCusage 2018-01-24 2018-01-25
ps aux | grep crontab
kill {PID}
crontab -e
It is now possible to run Docker and Apptainer/Singularity containers on the Owens and Pitzer clusters at OSC. Single-node jobs are currently supported, including GPU jobs; MPI jobs are planned for the future.
From the Docker website: "A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings."
As of June 21st, 2022, Singularity is replaced with Apptainer, which is just a renamed open-source project. For more information visit the Apptainer/Singularity page
This document will describe how to run Docker and Apptainer/Singularity containers on the Owens and Pitzer. You can use containers from Docker Hub, Sylabs Cloud, or any other source. As examples we will use hello-world
from Singularity Hub and ubuntu
from Docker Hub.
If you encounter any error, check out Known Issues on using Apptainer/Singularity at OSC. If the issue can not be resolved, please contact OSC help.
The most up-to-date help on Apptainer/Singularity comes from the command itself.
apptainer help
User guides and examples can be found in Apptainer documents.
No setup is required. You can use Apptainer/Singularity directly on all clusters.
An Apptainer/Singularity container is a single file with a .sif
extension.
You can simply download ("pull") a container from a hub. Popular hubs are Docker Hub and Singularity Hub. You can go there and search if they have a container that meets your needs. Docker Hub has more containers and may be more up to date but supports a much wider community than just HPC. Singularity Hub is for HPC, but the number of available containers are fewer. Additionally there are domain and vendor repositories such as biocontainers and NVIDIA HPC containers that may have relevant containers.
Pull from the 7.2.0 branch of the gcc repository on Docker Hub. The 7.2.0 is called a tag.
apptainer pull docker://gcc:7.2.0
Filename: gcc_7.2.0.sif
Pull an Ubuntu container from Docker Hub.
apptainer pull docker://ubuntu:18.04
Filename: ubuntu_18.04.sif
Pull the singularityhub/hello-world
container from the Singularity hub. Since no tag is specified it pulls from the master branch of the repository.
apptainer pull shub://singularityhub/hello-world
Filename: hello-world_latest.sif
Downloading containers from the hubs is not the only way to get one. You can, for example get a copy from your colleague's computer or directory. If you would like to create your own container you can start from the user guide below. If you have any questions, please contact OSC Help.
There are four ways to run a container under Apptainer/Singularity.
You can do this either in a batch job or on a login node.
We note that the operating system on Owens is Red Hat:
[owens-login01]$ cat /etc/os-release NAME="Red Hat Enterprise Linux Server" VERSION="7.5 (Maipo)" ID="rhel" [..more..]
In the examples below we will often check the operating system to show that we are really inside a container.
If you simply run the container image it will execute the container’s runscript.
Example: Run singularityhub/hello-world
Note that this container returns you to your native OS after you run it.
[owens-login01]$ ./hello-world_latest.sif Tacotacotaco
The Apptainer “run” sub-command does the same thing as running a container directly as described above. That is, it executes the container’s runscript.
Example: Run a container from a local file
[owens-login01]$ apptainer run hello-world_latest.sif Tacotacotaco
Example: Run a container from a hub without explicitly downloading it
[owens-login01]$ apptainer run shub://singularityhub/hello-world INFO: Downloading shub image Progress |===================================| 100.0% Tacotacotaco
The Apptainer “exec” sub-command lets you execute an arbitrary command within your container instead of just the runscript.
Example: Find out what operating system the singularityhub/hello-world
container uses
[owens-login01]$ apptainer exec hello-world_latest.sif cat /etc/os-release NAME="Ubuntu" VERSION="14.04.5 LTS, Trusty Tahr" ID=ubuntu [..more..]
The Apptainer “shell” sub-command invokes an interactive shell within a container.
Example: Run an Ubuntu shell. Note the “Apptainer” prompt within the shell.
[owens-login01 ~]$ apptainer shell ubuntu_18.04.sif Singularity ubuntu_18.04.sif:~> cat /etc/os-release NAME="Ubuntu" VERSION="18.04 LTS (Bionic Beaver)" ID=ubuntu [.. more ..] Singularity ubuntu_18.04.sif:~> exit exit
When you use a container you run within the container’s environment. The directories available to you by default from the host environment are
/fs/ess
/tmp
You can review our Available File Systems page for more details about our file system access policy.
If you run the container within a job you will have the usual access to the $PFSDIR
environment variable with adding node attribute "pfsdir
" in the job request (--gres=pfsdir
). You can access most of our file systems from a container without any special treatment.
If you have a GPU-enabled container you can easily run it on Owens or Pitzer just by adding the --nv
flag to the apptainer exec or run command. The example below comes from the "exec" command section of Apptainer User Guide. It runs a TensorFlow example using a GPU on Owens. (Output has been omitted from the example for brevity.)
[owens-login01]$ sinteractive -n 28 -g 1...
[o0756]$
git clone https://github.com/tensorflow/models.git
[o0756]$
apptainer exec --nv docker://tensorflow/tensorflow:latest-gpu \ python ./models/tutorials/image/mnist/convolutional.py
In some cases it may be necessary to bind the CUDA_HOME path and add $CUDA_HOME/lib64
to the shared library search path:
[owens-login01]$ sinteractive -n 28 -g 1...
[o0756]$
module load cuda [o0756]$ export APPTAINER
_BINDPATH=$CUDA_HOME [o0756]$ export APPTAINERENV_LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64[o0756]$
apptainer exec --nv my_container mycmd
It is possible to build or create a custom container, but it will require additional setup. Please contact OSC support for more details.
JupyterLab stores the main build of JupyterLab with associated data, including extensions in Application Directory. The default Application Directory is the JupyterLab installation directory where is read-only for OSC users. Unlike Jupyter Notebook, JupyterLab cannot accommodate multiple paths for extensions management. Therefore we set the user's home directory for Application Directory so as to allow user to manage extensions.
After launching a JupyterLab session, open a notebook and run
!jupyter lab path
Check if home directory is set for to the Application Directory
Application directory: /users/PXX1234/user/.jupyter/lab/3.0 User Settings directory: /users/PXX1234/user/.jupyter/lab/user-settings Workspaces directory: /users/PXX1234/user/ondemand/data/sys/dashboard/batch_connect/dev/bc_osc_jupyter/output/f2a4f918-b18c-4d2a-88bc-4f4e1bdfe03e
If home directory is NOT set, try removing the corresonding directory, e.g. if you are using JupyterLab 2.2, remove the entire directory $HOME/.jupyter/lab/2.2
and re-launch JupyterLab 2.2.
If this is the first time to use extension or use extensions that are installed with different Jupyter version or on different cluster, you will need to run
!jupyter lab build
to initialize the JupyterLab application.
To manage and install extensions, simply click Extension Manager icon at the side bar:
If you plan on using GPUs in tensorflow or pytorch see HOWTO: Use GPU with Tensorflow and PyTorch
This is an exmaple to utilize a GPU to improve performace in our python computations. We will make use of the Numba python library. Numba provides numerious tools to improve perfromace of your python code including GPU support.
This tutorial is only a high level overview of the basics of running python on a gpu. For more detailed documentation and instructions refer to the official numba document: https://numba.pydata.org/numba-doc/latest/cuda/index.html
To begin, you need to first create and new conda environment or use an already existing one. See HOWTO: Create Python Environment for more details.
Once you have an environment created and activated run the following command to install the latest version of Numba into the environment.
conda install numba conda install cudatoolkit
You can specify a specific version by replacing numba
with number={version}
. In this turtorial we will be using numba version 0.57.0 and cudatoolkit version 11.8.0.
Now we can use numba to write a kernel function. (a kernel function is a GPU function that is called from CPU code).
To invoke a kernel, you need to include the @cuda.jit
decorator above your gpu function as such:
@cuda.jit def my_funtion(array): # function code
Next to invoke a kernel you must first specify the thread heirachy with the number of blocks per grid and threads per block you want on your gpu:
threadsperblock = 32 blockspergrid = (an_array.size + (threadsperblock - 1))
For more details on thread heirachy see: https://numba.pydata.org/numba-doc/latest/cuda/kernels.html
Now you can call you kernel as such:
my_function[blockspergrid, threadsperblock](an_array)
Kernel instantiation is done by taking the compiled kernel function (here my_function
) and indexing it with a tuple of integers.
Run the kernel, by passing it the input array (and any separate output arrays if necessary). By default, running a kernel is synchronous: the function returns when the kernel has finished executing and the data is synchronized back.
Note: Kernels cannot explicitly return a value, as a result, all returned results should be written to a reference. For example, you can write your output data to an array which was passed in as an argument (for scalars you can use a one-element array)
Before we can use a kernel on an array of data we need to transfer the data from host memory to gpu memory.
This can be done by (assume arr
is already created and filled with the data):
d_arr = cuda.to_device(arr)
d_arr
is a reference to the data stored in the gpu memory.
Now to get the gpu data back into host memory we can run (assume gpu_arr
has already been initialized ot an empty array):
d_arr.copy_to_host(gpu_arr)
from numba import cuda import numpy as np from timeit import default_timer as timer # gpu kernel function @cuda.jit def increment_by_one_gpu(an_array): #get the absolute position of the current thread in out 1 dimentional grid pos = cuda.grid(1) #increment the entry in the array based on its thread position if pos < an_array.size: an_array[pos] += 1 # cpu function def increment_by_one_nogpu(an_array): # increment each position using standard iterative approach pos = 0 while pos < an_array.size: an_array[pos] += 1 pos += 1 if __name__ == "__main__": # create numpy array of 10 million 1s n = 10_000_000 arr = np.ones(n) # copy the array to gpu memory d_arr = cuda.to_device(arr) # print inital array values print("GPU Array: ", arr) print("NON-GPU Array: ", arr) #specify threads threadsperblock = 32 blockspergrid = (len(arr) + (threadsperblock - 1)) // threadsperblock # start timer start = timer() # run gpu kernel increment_by_one_gpu[blockspergrid, threadsperblock](d_arr) # get time elapsed for gpu dt = timer() - start print("Time With GPU: ", dt) # restart timer start = timer() # run cpu function increment_by_one_nogpu(arr) # get time elapsed for cpu dt = timer() - start print("Time Without GPU: ", dt) # create empty array gpu_arr = np.empty(shape=d_arr.shape, dtype=d_arr.dtype) # move data back to host memory d_arr.copy_to_host(gpu_arr) print("GPU Array: ", gpu_arr) print("NON-GPU Array: ", arr)
Now we need to write a job script to submit the python code.
#!/bin/bash #SBATCH --account <project-id> #SBATCH --job-name Python_ExampleJob #SBATCH --nodes=1 #SBATCH --time=00:10:00 #SBATCH --gpus-per-node=1 module load miniconda3 module list source activate gpu_env python gpu_test.py conda deactivate
Running the above job returns the following output:
GPU Array: [1. 1. 1. ... 1. 1. 1.] NON-GPU Array: [1. 1. 1. ... 1. 1. 1.] Time With GPU: 0.34201269410550594 Time Without GPU: 2.2052815910428762 GPU Array: [2. 2. 2. ... 2. 2. 2.] NON-GPU Array: [2. 2. 2. ... 2. 2. 2.]
As we can see, running the function on a gpu resulted in a signifcant speed increase.
see HOWTO: Use a Conda/Virtual Environment With Jupyter for more information on how to setup jupyter kernels.
One you have your jupyter kernel created, activate your python environment in the command line (source activate ENV
).
Install numba and cudatoolkit the same as was done above:
conda install numba conda install cudatoolkit
Now you should have numba installed into your jupyter kernel.
See Python page for more information on how to access your jupyter notebook on OnDemand.
Make sure you select a node with a gpu before laucnhing your jupyter app:
If you are using Tensorflow, PyTorch or other machine learning frameworks you may want to also consider using Horovod. Horovod will take single-GPU training scripts and scale it to train across many GPUs in parallel.
Globus is a cloud-based service designed to let users move, share, and discover research data via a single interface, regardless of its location or number of files or size.
Globus was developed and is maintained at the University of Chicago and is used extensively at supercomputer centers and major research facilities.
Globus is available as a free service that any user can access. More on how Globus works can be found on the Globus "How It Works" page.
Globus can be used to transfer data between source and destination systems including OSC storage, cloud storage, storage at other HPC centers with Globus support, laptops, desktops.
If you would like to transfer data between OSC storage and your own laptop/desktop which has not installed Globus Connect Personal yet, please go to 'Globus Connect Personal Installation' first
Log into https://www.globus.org/
When prompted to login, select "Ohio Supercomputer Center (OSC)" from the drop-down list of organizations and then click Continue. This will redirect you to the Ohio Supercomputer Center login page where you can log in with your OSC username and password.
Click 'File Manager' on the left of the page. Switch to 'two panel' view by click icons next to 'Panels'. One panel will act as the source while the other is the destination.
Click 'Collection' to search the collection of your data.
For OSC storage, use 'OSC endpoints' information to locate the collection.
Select the file(s) or directory that you would like to transfer between collections.
Click the "Transfer or Sync to..." button in the center control panel.
Click the blue "Start" button above the file selector.
A ribbon should appear that recognizes the transfer request. You can hit View Details to take you to the Activity tab in the command menu.
Click Activity in the command menu on the left of the page to go to the Activity page.
A green checkmark will appear at the top of the page with a Transfer Complete Message.
The email you have set up with your Globus profile will receive a confirmation receipt of the request.
The files will now be accessible in the transfer location.
Globus Installation on Windows
Launch the application installer.
If you have local administrator permissions on your machine, and will be the only user, click on 'Install'.
After installation has completed GCP will launch. Click on 'Log In' in order to authenticate with Globus and begin the Collection Setup process.
Grant the required consents to GCP Setup.
Enter the details for your GCP Collection.
Exit the Setup process or open the Globus web app to view collection details or move data to or from your collection.
At the end of the installation, you will see an icon in the menu bar at the bottom of your screen, indicating that Globus Connect Personal is running and your new collection is ready to be used.
Endpoint | |
---|---|
OSC's home directory | OSC $HOME |
OSC's project directory | OSC /fs/project |
OSC's scratch directory | OSC /fs/scratch |
OSC's ess storage | OSC /fs/ess |
AWS S3 storage | OSC S3 |
OSC high assurance |
OSC /fs/ess/ High Assurance for project storage OSC /fs/scratch/ High Assurance for scratch storage |
With Globus, you can easily share research data with your collaborators. You don’t need to create accounts on the server(s) where your data is stored. You can share data with anyone using their identity or their email address.
To share data, you’ll create a guest collection and grant your collaborators access as described in the instructions below. If you like, you can designate other Globus users as "access managers" for the guest collection, allowing them to grant or revoke access privileges for other Globus users.
Log into Globus and navigate to the File Manager.
Select the collection that has the files/folders you wish to share and, if necessary, activate the collection.
Highlight the folder that you would like to share and Click Share in the right command pane.
If Share is not available, contact the endpoint’s administrator or refer to Globus Connect Server Installation Guide for instructions on enabling sharing. If you’re a using a Globus Connect Personal endpoint and you’re a Globus Plus user, enable sharing by opening the Preferences for Globus Connect Personal, clicking the Access tab, and checking the Sharable box.
Provide a name for the guest collection, and click Create Share. If this is the first time you are accessing the collection, you may need to authenticate and consent to allow Globus services to manage your collections on your behalf.
When your collection is created, you’ll be taken to the Sharing tab, where you can set permissions. The starting permissions give read and write access (and the Administrator role) to the person who created the collection.
Click the Add Permissions button or icon to share access with others. You can add permissions for an individual user, for a group, or for all logged-in users. In the Identity/E-mail field, type a person’s name or username (if user is selected) or a group name (if group is selected) and press Enter. Globus will display matching identities. Pick from the list. If the user hasn’t used Globus before or you only have an email address, enter the email address and click Add.
You can add permissions to subfolders by entering a path in the Path field.
After receiving the email notification, your colleague can click on the link to log into Globus and access the guest collection.
You can allow others to manage the permissions for a collection you create. Use the Roles tab to manage roles for other users. You can assign roles to individual users or to groups. The default is for the person who created the collection to have the Administrator role.
The Access Manager role grants the ability to manage permissions for a collection. (Users with this role automatically have read/write access for the collection.)
When a role is assigned to a group, all members of the group have the assigned role.
Sometimes, a group may need to share data uploaded by several OSC users with external entities using Globus. To simplify this process OSC can help set up a service account that owns the data and create a Globus share that makes the data accessible to individuals. Contact OSC Help for this service.
Further Reading
Beofre creating a new collection, please set up a S3 bucket and configure the IAM access permissions to that bucket. If you need more information on how to do that, see the AWS S3 documentation and Amazon Web Services S3 Connector pages.
Globus is a cloud-based service designed to let users move, share, and discover research data via a single interface, regardless of its location or number of files or size.
This makes Globus incredibly useful for transferring large files for users. This service is also able to work alongside OneDrive, making your this storage even more attainable. The OneDrive connection to Globus is only available for Ohio State clients with a valid OSU email.
Log into https://www.globus.org/
When prompted to login, select "Ohio Supercomputer Center (OSC)" from the drop-down list of organizations and then click Continue. This will redirect you to the Ohio Supercomputer Center login page where you can log in with your OSC username and password.
Select the File Manager tab on the left hand toolbar. You will be introduced to the file exchange function in the two-panel format.
In the left panel, select the collection that you would like to import the data to. In the right panel, you can simply type "OSU OneDrive" or "OSU OneDrive Student" and the collection will appear. Students will need to use their buckeyemail.osu.edu emails in order to access the student OneDrive.
The first time that you access this collection, you will be prompted for some initial account setup.
Complete the Authentication Request and, if prompted, verify that you wish to grant access to the Collection.
Once opened, the default location will be My Files. Click the "up one folder" icon to see the other locations.
Select the file(s) or directory that you would like to transfer between collections. You can now select the "Transfer or Sync to..." and hit the blue "Start" icon above the file selector.
Click Activity in the command menu on the left of the page to go to the Activity page. You will now be able to monitor the processing of the request and the confirmation receipt will appear here.
To follow a SharePoint site, log into the OSU SharePoint service with your OSC name.# credentials. Next, navigate to the site you would like to connect to via Globus and click the star icon on the site to follow:
Finally, return to Globus and click the "up one folder" button until you see the "Shared libraries" and the SharePoint site will now be available.
OSC clients who are affiliated with Ohio State can deploy their own endpoint on a server using OSU subscriptions. Please follow the steps below:
SSHing directly to a compute node at OSC - even if that node has been assigned to you in a current batch job - and starting VNC is an "unsafe" thing to do. When your batch job ends (and the node is assigned to other users), stray processes will be left behind and negatively impact other users. However, it is possible to use VNC on compute nodes safely.
The examples below are for Pitzer. If you use other systems, please see this page for supported versions of TurboVNC on our systems.
Step one is to create your VNC server inside a batch job.
The preferred method is to start an interactive job, requesting an gpu node, and then once your job starts, you can start the VNC server.
salloc --nodes=1 --ntasks-per-node=40 --gpus-per-node=1 --gres=vis --constraint=40core srun --pty /bin/bash
This command requests an entire GPU node, and tells the batch system you wish to use the GPUs for visualization. This will ensure that the X11 server can access the GPU for acceleration. In this example, I have not specified a duration, which will then default to 1 hour.
module load virtualgl module load turbovnc
Then start your VNC server. (The first time you run this command, it may ask you for a password - this is to secure your VNC session from unauthorized connections. Set it to whatever password you desire. We recommend a strong password.)
vncserver
vncpasswd
command.The output of this command is important: it tells you where to point your client to access your desktop. Specifically, we need both the host name (before the :), and the screen (after the :).
New 'X' desktop is p0302.ten.osc.edu:1
Because the compute nodes of our clusters are not directly accessible, you must log in to one of the login nodes and allow your VNC client to "tunnel" through SSH to the compute node. The specific method of doing so may vary depending on your client software.
The port assigned to the vncserver will be needed. It is usually 5900 + <display_number>. e.g.
New 'X' desktop is p0302.ten.osc.edu:1
would use port 5901.
I will be providing the basic command line syntax, which works on Linux and MacOS. You would issue this in a new terminal window on your local machine, creating a new connection to Pitzer.
ssh -L <port>:<node_hostname>.ten.osc.edu:<port> <username>@pitzer.osc.edu
The above command establishes a proper ssh connection for the vnc client to use for tunneling to the node.
Open your VNC client, and connect to localhost:<screen_number>
, which will tunnel to the correct node on Pitzer.
This example uses Chicken of the VNC, a MacOS VNC client. It is a vncserver started on host n0302 with port 5901 and display 1.
The default window that comes up for Chicken requires the host to connect to, the screen (or port) number, and optionally allows you to specify a host to tunnel through via SSH. This screenshot shows a proper configuration for the output of vncserver shown above. Substitute your host, screen, and username as appropriate.
When you click [Connect], you will be prompted for your HPC password (to establish the tunnel, provided you did not input it into the "password" box on this dialog), and then (if you set one), for your VNC password. If your passwords are correct, the desktop will display in your client.
This example shows how to create a SSH tunnel through your ssh client. We will be using Putty in this example, but these steps are applicable to most SSH clients.
First, make sure you have x11 forwarding enabled in your SSH client.
Next, open up the port forwarding/tunnels settings and enter the hostname and port you got earlier in the destination field. You will need to add 5900 to the port number when specifiying it here. Some clients may have separate boxes for the desination hostname and port.
For source port, pick a number between 11-99 and add 5900 to it. This number between 11-99 will be the port you connect to in your VNC client.
Make sure to add the forwaded port, and save the changes you've made before exiting the configutations window.
Now start a SSH session to the respective cluster your vncserver is running on. The port forwarding will automatically happen in the background. Closing this SSH session will close the forwarded port; leave the session open as long as you want to use VNC.
Now start a VNC client. TurboVNC has been tested with our systems and is recommended. Enter localhost:[port], replacing [port] with the port between 11-99 you chose earlier.
If you've set up a VNC password you will be prompted for it now. A desktop display should pop up now if everything is configured correctly.
Occasionally you may make a mistake and start a VNC server on a login node or somewhere else you did not want to. In this case it is important to know how to properly kill your VNC server so no processes are left behind.
The command syntax to kill a VNC session is:
vncserver -kill :[screen]
In the example above, screen would be 1.
You need to make sure you are on the same node you spawned the VNC server on when running this command.
The IPython kernel for a Conda/virtual environment must be installed on Jupyter prior to use. This tutorial will walk you though the installation and setup procedure.
Load the preferred version of Python or Miniconda3 using the command:
module load python
or
module load miniconda3
Replace "python" or "miniconda3" with the appropriate version, which could be the version you used to create your Conda/venv environment. You can check available Python versions by using the command:
module spider python
Run one of the following commands based on how your Conda/virtual environment was created. Replace "MYENV" with the name of your Conda environment or the path to the environment.
If the Conda environment was created via conda create -n MYENV
command, use the following command:
~support/classroom/tools/create_jupyter_kernel conda MYENV
If the Conda environment was created via conda create -p /path/to/MYENV
command, use the following command:
~support/classroom/tools/create_jupyter_kernel conda /path/to/MYENV
If the Python virtual environment was created via python3 -m venv /path/to/MYENV
command, use the following command
~support/classroom/tools/create_jupyter_kernel venv /path/to/MYENV
The resulting kernel name appears as "MYENV [/path/to/MYENV]" in the Jupyter kernel list. You can change the display name by appending a preferred name in the above commands. For example:
~support/classroom/tools/create_jupyter_kernel conda MYENV "My_Research_Project"
This results in the kernel name "My_Research_Project" in the Jupyter kernel list.
You should now be able to access the new Jupyter kernel on OnDemand in a jupyter session. See Usage section of Python page for more details on accessing the Jupyter app.
According to Jupyterlab page, debugger requires ipykernel >= 6. Please create your own kernel with conda using the following commands:
module load miniconda conda create -n jupyterlab-debugger -c conda-forge "ipykernel>=6" xeus-python ~support/classroom/tools/create_jupyter_kernel conda jupyterlab-debugger
You should see a kernelspec 'conda_jupyterlab-debugger' created in home directory. Once the debugger kernel is done, you can use it:
1. go to OnDemand
2. request a JupyterLab app with kernel 3
3. open a notebook with the debugger kernel.
4. you can enable debug mode at upper-right kernel of the notebook
If the create_jupyter_kernel
script does not work for you, try the following steps to manually install kernel:
# change to the proper version of python
module load python
# replace with the name of conda env
MYENV=useful-project-name
# create the cpnda enironment
conda create -n $MYENV
# Activate your conda/virtual environment
## For Conda environment
source activate $MYENV
# ONLY if you created venv instead of conda env
## For Python Virtual environment
source /path/to/$MYENV/bin/activate
# Install Jupyter kernel
python -m ipykernel install --user --name $MYENV --display-name "Python ($MYENV)"
If the envirnoment is rebuilt or renamed, users may want to erase any custom jupyter kernel installations.
rm -rf ~/.local/share/jupyter/kernels/${MYENV}
Many software packages require a license. These licenses are usually made available via a license server, which allows software to check out necessary licenses. In this document external refers to a license server that is not hosted inside OSC.
If you have such a software license server set up using a license manager, such as FlexNet, this guide will instruct you on the necessary steps to connect to and use the licenses at OSC.
Users who wish to host their software licenses inside OSC should consult OSC Help.
Broadly speaking, there are two different ways in which the external license server's network may be configured. These differ by whether the license server is directly externally reachable or if it sits behind a private internal network with a port forwarding firewall.
If your license server sits behind a private internal network with a port forwarding firewall you will need to take additional steps to allow the connection from our systems to the license server to be properly routed.
If you are unsure about which category your situation falls under contact your local IT administrator.
In order for connections from OSC to reach the license server, the license server's firewall will need to be configured. All outbound network traffic from all of OSC's compute nodes are routed through a network address translation host (NAT).
The license server should be configured to allow connections from nat.osc.edu including the following IP addresses to the SERVER:PORT where the license server is running:
A typical FlexNet-based license server uses two ports: one is server port and the other is daemon port, and the firewall should be configured for the both ports. A typical license file looks, for example,
SERVER licXXX.osc.edu 0050XXXXX5C 28000 VENDOR {license name} port=28001
In this example, "28000" is the server port, and "28001" is the daemon port. The daemon port is not mandatory if you use it on a local network, however it becomes necessary if you want to use it outside of your local network. So, please make sure you declared the daemon port in the license file and configured the firewall for the port.
The firewall settings should be verified by attempting to connect to the license server from the compute environment using telenet.
Get on to a compute node by requesting a short, small, interactive job and test the connection using telenet:
telnet <License Server IP Address> <Port#>
It is also recommended to restrict accessibility using the remote license server's access control mechanisms, such as limiting access to particular usernames in the options.dat file used with FlexNet-based license servers.
For FlexNet tools, you can add the following line to your options.dat file, one for each user.
INCLUDEALL USER <OSC username>
If you have a large number of users to give access to you may want to define a group using GROUP
within the options.dat file and give access to that whole group using INCLUDEALL GROUP <group name>
.
Users who use other license managers should consult the license manager's documentation.
The software must now be told to contact the license server for it's licenses. The exact method of doing so can vary between each software package, but most use an environment variable that specifies the license server IP address and port number to use.
For example LS DYNA uses the environment variable LSTC_LICENSE
and LSTC_LICENSE_SERVER
to know where to look for the license. The following lines would be added to a job script to tell LS-DYNA to use licenses from port 2345 on server 1.2.3.4, if you use bash:
export LSTC_LICENSE=network export LSTC_LICENSE_SERVER=2345@1.2.3.4
or, if you use csh:
setenv LSTC_LICENSE network setenv LSTC_LICENSE_SERVER 2345@1.2.3.4
If the license server is behind a port forwarding firewall, and has a different IP address from the IP address of the firewall, additional steps must be taken to allow connections to be properly routed within the license server's internal network.
The following outlines details particular to a specific software package.
Uses the following environment variables:
ANSYSLI_SERVERS=<port>@<IP> ANSYSLMD_LICENSE_FILE=<port>@<IP>
If your license server is behind a port forwarding firewall and you cannot use a fully qualified domain name in the license file, you can add ANSYSLI_EXTERNAL_IP={external IP address} to ansyslmd.ini on the license server.
This document shows you how to set soft limits using the ulimit
command.
The ulimit
command sets or reports user process resource limits. The default limits are defined and applied when a new user is added to the system. Limits are categorized as either soft or hard. With the ulimit
command, you can change your soft limits for the current shell environment, up to the maximum set by the hard limits. You must have root user authority to change resource hard limits.
ulimit [-HSTabcdefilmnpqrstuvx [Limit]]
flags | description |
---|---|
-H | Specifies that the hard limit for the given resource is set. If you have root user authority, you can increase the hard limit. Anyone can decrease it |
-S | Specifies that the soft limit for the given resource is set. A soft limit can be increased up to the value of the hard limit. If neither the -H nor -S flags are specified, the limit applies to both |
-a | Lists all of the current resource limits |
-b | The maximum socket buffer size |
-c | The maximum size of core files created |
-d | The maximum size of a process's data segment |
-e | The maximum scheduling priority ("nice") |
-f | The maximum size of files written by the shell and its children |
-i | The maximum number of pending signals |
-l | The maximum size that may be locked into memory |
-m | The maximum resident set size (many systems do not honor this limit) |
-n | The maximum number of open file descriptors (most systems do not allow this value to be set) |
-p | The pipe size in 512-byte blocks (this may not be set) |
-q | The maximum number of bytes in POSIX message queues |
-r | The maximum real-time scheduling priority |
-s | The maximum stack size |
-t | The maximum amount of cpu time in seconds |
-u | The maximum number of processes available to a single user |
-v | The maximum amount of virtual memory available to the shell and, on some systems, to its children |
-x | The maximum number of file locks |
-T | The maximum number of threads |
The limit for a specified resource is set when the Limit parameter is specified. The value of the Limit parameter can be a number in the unit specified with each resource, or the value "unlimited." For example, to set the file size limit to 51,200 bytes, use:
ulimit -f 100
To set the size of core dumps to unlimited, use:
ulimit –c unlimited
The ulimit
command affects the current shell environment. When a MPI program is started, it does not spawn in the current shell. You have to use srun to start a wrapper script that sets the limit if you want to set the limit for each process. Below is how you set the limit for each shell (We use ulimit –c unlimited
to allow unlimited core dumps, as an example):
#!/bin/bash #SBATCH --ntasks=2 #SBATCH --time=5:00:00 #SBATCH ... ... srun ./test1 ...
#!/bin/bash ulimit –c unlimited .....(your own program)
sbatch myjob
The data transfer speed between OSC and another network can be tested.
Connect to a data mover host at osc and note the hostname.
$ ssh sftp.osc.edu # login $ hostname datamover02.hpc.osc.edu # the hostname may also be datamover01.hpc.osc.edu
From there, an iperf3 server process can be started. Note the port used.
iperf3 -s -p 5201 Server listening on 5201 # the above port number could be different
Next, on your local machine, try to connect to the iperf3 server process
iperf3 -c datamover02.hpc.osc.edu -p 5201
If it connects sucessfully, then it will start testing and then finish with a summary
Connecting to host datamover02.hpc.osc.edu, port 5201 ... - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 7] 0.00-10.00 sec 13.8 MBytes 11.6 Mbits/sec sender [ 7] 0.00-10.00 sec 13.8 MBytes 11.6 Mbits/sec receiver
For the data downloaded speed, you can also test the newwork performace in the reverse direction, with the server on datamover02 sending data, and the client on your computer receiving data:
iperf3 -c datamover02.hpc.osc.edu -p 5201 -R
Docker can be used if iperf3 is not installed on client machine, but docker is.
$ docker run --rm -it networkstatic/iperf3 -c datamover02.hpc.osc.edu -p 5201