vLLM

vLLM is an open-source inference server for large language models (LLMs).

vLLM is in early user testing phase - not all functionality is guaranteed to work.  Contact oschelp@osc.edu with any questions.
vLLM is not currently suitable for use with protected or sensitive data - do not use if you need protected data service. See https://www.osc.edu/resources/protected_data_service for more details.

Availability and Restrictions

Versions

vLLM is available on OSC Clusters. The versions currently available at OSC are:

Version Cardinal Ascend
0.12.0 X X

 

You can use module spider vllm to view available modules for a given machine.

Access:

All OSC users may use vLLM, but individual models may have their own license restrictions.

Publisher/Vendor/Repository and License Type

Apache-2 license: https://github.com/vllm-project/vllm?tab=Apache-2.0-1-ov-file#readme

Prerequisites

  • GPU Usage: vLLM should be run with a GPU for best performance. 

Due to the need for GPUs, we recommend not running vLLM on login nodes nor OnDemand lightweight desktops.

Running vLLM Overview

1. Load module

2. Start vLLM

 

Commands

vLLM is available through the module system and must be loaded prior to running any of the commands below:

loading vllm module:
module load vllm/0.12.0
Starting vllm:
vllm_start <model_name>

Model names follow the HuggingFace format, e.g., "meta-llama/Llama-3.2-3B".

If the model is available and the service starts successfully, this will print out a port number for the vLLM service. 

VLLM_API_PORT: 61234

This port number is only an example - your port number will differ from the one above.

The VLLM_API_PORT environment variable will be used to define the API endpoint.

Stopping vllm:

vllm can be manually stopped with the following commands:

vllm_stop

It is also killed upon module unload.  If you want to stop the services, you can simply unload the vllm module:

module unload vllm

Model Management

By default, vLLM uses a central, read-only model repository defined by VLLM_CACHE_DIR, offering clients the use of a small number of well-performing, curated models.

However, you can use custom models and manage your own set of models by setting VLLM_CACHE_DIR to a path you have write access to, such as a project directory or scratch space.  This must be done prior to starting vLLM.

export VLLM_CACHE_DIR=/fs/project/ABC1234/vllm/models
vllm_start <model_name>
installing a model:

Upon running vllm_start <model_name>, the target model is automatically pulled to the currently defined VLLM_CACHE_DIR location if it does not already exist. 

You cannot use custom models unless you have not redefined your VLLM_CACHE_DIR prior to starting vLLM, as the default model path is read-only. 
Downloading large LLMs can exceed your disk space quota.  Check model sizes before downloading!


Some models require licensing agreements or are otherwise restricted and require a Hugging Face account and login.  With the vLLM module loaded, use the huggingface-cli tool to login:

hf auth login

You will need your Hugging Face token.  For more details, see https://huggingface.co/docs/huggingface_hub/en/guides/cli.

 

Batch Usage

The vLLM module can be used in batch mode by loading the module in your batch script.  For example, you may want to run offline inference by running a script that relies on an inference endpoint.

vLLM provides an OpenAI API-compliant API endpoint, and can be accessed an OpenAI API-compliant client, meaning you can bring your own clients or write your own.  As long as you can send requests to localhost:$VLLM_API_PORT/v1/, this should work and support a wide variety of workflows. 

For the most up-to-date API compatibility information (and more examples), see: vLLM API

vLLM supports a number of portions of the OpenAI API, including Completions, Chat Completions, Embeddings, and more, but does not currently support the complete OpenAI API, including tools and responses.

Here is a basic Python example using the OpenAI package:

import os
from openai import OpenAI

ollama_port = os.getenv("VLLM_API_PORT")

client = OpenAI( base_url = f"http://localhost:{VLLM_API_PORT}/v1", api_key="") 

response = client.chat.completions.create(
    model = "gemma3:12b",
    messages = [
        {"role": "developer", "content": "talk like a pirate"},
        {"role": "user", "content": "how do I check a Python object's type?"}
     ]
)

For more advanced API usage example with asynchronous requests, see this GitHub project: OSC/async_llm_api 

Please note this software is in early user testing and might not function as desired.  Please reach out to oschelp@osc.edu with any issues.

Jupyter Usage

This is under development - contact oschelp@osu.edu if you're interested in this functionality.

Supercomputer: