vLLM

vLLM is an open-source inference server for large language models (LLMs).

vLLM is in early user testing phase - not all functionality is guaranteed to work. Contact oschelp@osc.edu with any questions.

vLLM is not currently suitable for use with protected or sensitive data - do not use if you need protected data service. See https://www.osc.edu/resources/protected_data_service for more details.

Availability and Restrictions

Versions

vLLM is available on OSC Clusters. The versions currently available at OSC are:

Version	Cardinal	Ascend
0.14.1	X	X

You can use module spider vllm to view available modules for a given machine.

Access:

All OSC users may use vLLM, but individual models may have their own license restrictions.

Publisher/Vendor/Repository and License Type

Apache-2 license: https://github.com/vllm-project/vllm?tab=Apache-2.0-1-ov-file#readme

Built with Llama

Llama 3.2 is licensed under the Llama 3.2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. Full license here: https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE

Prerequisites

GPU Usage: vLLM should be run with a GPU for best performance.

Due to the need for GPUs, we recommend not running vLLM on login nodes nor OnDemand lightweight desktops.

Commands

vLLM is available through the module system and must be loaded prior to running any of the commands below:

loading vllm module:

module load vllm/0.14.1

Starting vllm:

vllm_start <model_name>

Model names follow the HuggingFace format, e.g., "meta-llama/Llama-3.2-3B".

If the model is available and the service starts successfully, this will print out a port number for the vLLM service.

VLLM_API_PORT: 61234

This port number is only an example - your port number will differ from the one above.

The VLLM_API_PORT environment variable will be used to define the API endpoint.

Stopping vllm:

vllm can be manually stopped with the following commands:

vllm_stop

It is also killed upon module unload. If you want to stop the services, you can simply unload the vllm module:

module unload vllm

Model Management

By default, vLLM uses a central, read-only model repository, offering clients the use of a small number of well-performing, curated models.

With the vllm module loaded, use hf cache scan to see available models.

You can also use custom models and manage your own set of models by setting HF_HOME to a path you have write access to, such as a project directory or scratch space. This must be done prior to starting vLLM.

export HF_HOME=/fs/project/ABC1234/vllm/.cache/huggingface
vllm_start <model_name>

installing a model:

Upon running vllm_start <model_name>, the target model is automatically pulled to the currently defined HF_HOME location if it does not already exist.

You cannot use custom models unless you have not redefined your HF_HOME prior to starting vLLM, as the default model path is read-only.

Downloading large LLMs can exceed your disk space quota. Check model sizes before downloading!

Some models require licensing agreements or are otherwise restricted and require a Hugging Face account and login. With the vLLM module loaded, use the huggingface-cli tool to login:

hf auth login

You will need your Hugging Face token. For more details, see https://huggingface.co/docs/huggingface_hub/en/guides/cli.

Deleting a Model:

Use the Hugging Face tool to delete models you no longer wish to keep cached locallly:

hf cache delete

If you are experimenting with multiple models, be sure to clean your cache occasionally to preserve available disk space. Note: there may be some mismatch in reported size on disk. Use du -sh <model_dir> for the most accurate accounting.

Batch Usage

The vLLM module can be used in batch mode by loading the module in your batch script. For example, you may want to run offline inference by running a script that relies on an inference endpoint.

A convenient way to run batch inference is with vLLM's run_batch command, which takes an input prompts jsonl and generates an output jsonl of responses:

module load vllm/0.14.1

vllm run_batch -i prompts.jsonl -o responses.jsonl

where the example content of prompts.jsonl is formatted as follows:

{"custom_id":"request1","method":"POST","url":"/v1/chat/completions","body":{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"Explain quantum computing"}],"temperature":0.7,"max_tokens":500}}
{"custom_id":"request2","method":"POST","url":"/v1/chat/completions","body":{"model":"Qwen/Qwen3-0.6B","messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"What is machine learning?"}],"temperature":0.7,"max_tokens":500}}

See vLLM's documentation for more detail: https://docs.vllm.ai/en/latest/cli/run-batch/

For resource estimation, required walltime will vary based on model size and prompt dataset, but we observed throughput rates of approximately 8k tokens/sec with gpt-oss-20b. As with other SLURM jobs, it is recommended to try a few small jobs to get a scaling estimate, and always give yourself a small padding (10-30%) when submitting the full job until you can reliably estimate your resource needs. Breaking your dataset into smaller subsets may allow for faster scheduling.

API Usage

For more flexibility with an API-based workflow, vLLM provides an OpenAI API-compliant API endpoint, and can be accessed an OpenAI API-compliant client, meaning you can bring your own clients or write your own. As long as you can send requests to http://localhost:$VLLM_API_PORT/v1/, this should work and support a wide variety of workflows.

For the most up-to-date API compatibility information (and more examples), see: vLLM API.

vLLM supports a number of portions of the OpenAI API, including Completions, Chat Completions, Embeddings, and more, but does not currently support the complete OpenAI API, including tools and responses.

Here is a basic Python example using the OpenAI package:

import os
from openai import OpenAI

port = os.environ.get("VLLM_API_PORT", "8000")
local_url = f"http://localhost:{port}/v1"

client = OpenAI(
    base_url=local_url,
    api_key="EMPTY",
)

model_name = "meta-llama/Llama-3.2-3B"
# For base models, use a raw string prompt
prompt = "Explain about gcc (GNU Compiler Collection):\n"

print(f"Connecting to vLLM server at {local_url}...")
print(f"Querying base model: {model_name}\n")
print("-" * 40)

response = client.completions.create(
    model=model_name,
    prompt=prompt,
    max_tokens=500,
    temperature=0.7
)

reply = response.choices[0].text
print(reply)

For more advanced API usage example with asynchronous requests (for high volume), see this GitHub project: OSC/async_llm_api

Please note this software is in early user testing and might not function as desired. Please reach out to oschelp@osc.edu with any issues.

Jupyter Usage

You can access vLLM in a Jupyter session through OnDemand. Select a model you need, start a Jupyter+vLLM session and select the "OSC OpenAI API 2.9.0" kernel. VLLM_API_PORT environment variable is available in this environment, and you can use it as shown above example.

Tag:

llm inference vllm

Supercomputer:

Ascend

Cardinal