HOWTO: Using MLFlow to track ML training and models

MLflow is a tool for managing the training and deployment of machine learning models.

At OSC, MLflow is available to help researchers and developers efficiently track training runs and manage models when working. This guide explains how to access MLflow at OSC, run example notebooks, and visualize your experiment data using the MLflow UI. MLflow is available on OSC clusters as part of the PyTorch module or can be installed to your virtual environment via package managers such as pip, conda, or uv.

We provide a repository with marimo notebooks demonstrating how to integrate MLflow into your training and inference codes on UCR.

To run them at OSC:

Clone the repository
Select the marimo OnDemand app from the list of apps.
In the field labeled Working Directory or Notebook, specify the path to one of the notebooks in the repo.
Select the Sandbox environment checkbox.
The first time you run a notebook in sandbox mode, you may be asked to install missing package dependencies. After the packages have been installed, restart the kernel or start a new marimo ondemand job.

Running the code in the notebooks will create an mlruns/ subdirectory in your local copy of the repository, which contains all of the logged training run data and any registered models. As described in the notebooks, this tracking data can be accessed via Python API. It is also possible to use the MLflow UI, which is available via the MLflow OnDemand app, to graphically view the data collected while executing the notebook. To view the data generated by these notebooks, set the Tracking URI directory to your local copy of the respository.

For more information about how to use MLflow read their documentation.

Note that MLflow offers several options for deploying MLflow servers as described in the MLflow docs. No servers have been deployed at OSC, but if this is necessary for your research please submit a ticket.