How to beginner · 3 min read

How to use vLLM Python API

Q: How to use vLLM Python API

Use the vllm Python package to load and run large language models locally by creating an LLM instance and calling its generate method with prompts and SamplingParams. For serving models via HTTP, start the vLLM server CLI and query it using the OpenAI-compatible Python SDK with a custom base_url.

Quick answer

Use the vllm Python package to load and run large language models locally by creating an LLM instance and calling its generate method with prompts and SamplingParams. For serving models via HTTP, start the vLLM server CLI and query it using the OpenAI-compatible Python SDK with a custom base_url.

PREREQUISITES

Python 3.8+
pip install vllm
pip install openai (for HTTP client usage)
Local vLLM model files or access to Hugging Face models

Setup

Install the vllm package via pip to use the Python API. If you plan to query a running vLLM server over HTTP, install the openai package as well.

bash

pip install vllm openai

Step by step

This example shows how to run local inference with the vllm Python API by loading a model and generating text from a prompt.

python

from vllm import LLM, SamplingParams

# Load a local model (e.g., meta-llama/Llama-3.1-8B-Instruct)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Define sampling parameters
params = SamplingParams(temperature=0.7, max_tokens=50)

# Generate text from a prompt
outputs = llm.generate(["Hello, how are you today?"], params)

# Extract and print the generated text
print(outputs[0].outputs[0].text)

output

Hello, how are you today? I'm doing well, thank you for asking. How can I assist you today?

Common variations

To serve models via HTTP, start the vLLM server with the CLI and query it using the OpenAI Python SDK with a custom base_url. This enables integration with existing OpenAI-compatible clients.

python

# Start the vLLM server (CLI command):
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Python client code to query the running server
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello from vLLM server!"}]
)
print(response.choices[0].message.content)

output

Hello from vLLM server! How can I help you today?

Troubleshooting

If you see ModuleNotFoundError, ensure vllm is installed with pip install vllm.
If the server does not start, check that the model path is correct and you have sufficient GPU memory.
For HTTP client errors, verify the base_url matches the server address and port.

✅

Key Takeaways

Use vllm.LLM and SamplingParams for efficient local model inference.
Start the vLLM server CLI to serve models over HTTP and query with OpenAI-compatible clients.
Always install vllm and verify model availability before running inference.
Set base_url in OpenAI SDK to connect to a local vLLM server.
Troubleshoot by checking installation, model paths, and server connectivity.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗