vLLM is an open-source inference server for large language models (LLMs).
vLLM is available on OSC Clusters. The versions currently available at OSC are:
| Version | Cardinal | Ascend |
|---|---|---|
| 0.12.0 | X | X |
You can use module spider vllm to view available modules for a given machine.
All OSC users may use vLLM, but individual models may have their own license restrictions.
Apache-2 license: https://github.com/vllm-project/vllm?tab=Apache-2.0-1-ov-file#readme
Due to the need for GPUs, we recommend not running vLLM on login nodes nor OnDemand lightweight desktops.
1. Load module
2. Start vLLM
vLLM is available through the module system and must be loaded prior to running any of the commands below:
module load vllm/0.12.0
vllm_start <model_name>
Model names follow the HuggingFace format, e.g., "meta-llama/Llama-3.2-3B".
If the model is available and the service starts successfully, this will print out a port number for the vLLM service.
VLLM_API_PORT: 61234
This port number is only an example - your port number will differ from the one above.
The VLLM_API_PORT environment variable will be used to define the API endpoint.
vllm can be manually stopped with the following commands:
vllm_stop
It is also killed upon module unload. If you want to stop the services, you can simply unload the vllm module:
module unload vllm
By default, vLLM uses a central, read-only model repository defined by VLLM_CACHE_DIR, offering clients the use of a small number of well-performing, curated models.
However, you can use custom models and manage your own set of models by setting VLLM_CACHE_DIR to a path you have write access to, such as a project directory or scratch space. This must be done prior to starting vLLM.
export VLLM_CACHE_DIR=/fs/project/ABC1234/vllm/models vllm_start <model_name>
Upon running vllm_start <model_name>, the target model is automatically pulled to the currently defined VLLM_CACHE_DIR location if it does not already exist.
Some models require licensing agreements or are otherwise restricted and require a Hugging Face account and login. With the vLLM module loaded, use the huggingface-cli tool to login:
hf auth login
You will need your Hugging Face token. For more details, see https://huggingface.co/docs/huggingface_hub/en/guides/cli.
The vLLM module can be used in batch mode by loading the module in your batch script. For example, you may want to run offline inference by running a script that relies on an inference endpoint.
vLLM provides an OpenAI API-compliant API endpoint, and can be accessed an OpenAI API-compliant client, meaning you can bring your own clients or write your own. As long as you can send requests to localhost:$VLLM_API_PORT/v1/, this should work and support a wide variety of workflows.
For the most up-to-date API compatibility information (and more examples), see: vLLM API.
vLLM supports a number of portions of the OpenAI API, including Completions, Chat Completions, Embeddings, and more, but does not currently support the complete OpenAI API, including tools and responses.
Here is a basic Python example using the OpenAI package:
import os
from openai import OpenAI
ollama_port = os.getenv("VLLM_API_PORT")
client = OpenAI( base_url = f"http://localhost:{VLLM_API_PORT}/v1", api_key="")
response = client.chat.completions.create(
model = "gemma3:12b",
messages = [
{"role": "developer", "content": "talk like a pirate"},
{"role": "user", "content": "how do I check a Python object's type?"}
]
)
For more advanced API usage example with asynchronous requests, see this GitHub project: OSC/async_llm_api
Please note this software is in early user testing and might not function as desired. Please reach out to oschelp@osc.edu with any issues.
This is under development - contact oschelp@osu.edu if you're interested in this functionality.