How to beginner · 3 min read

How to generate text with vLLM

Q: How to generate text with vLLM

Use the vllm Python library to generate text by loading a model with LLM(model="model-name") and calling generate() with prompts and SamplingParams. For server usage, run vllm serve and query via the OpenAI-compatible API.

Quick answer

Use the vllm Python library to generate text by loading a model with LLM(model="model-name") and calling generate() with prompts and SamplingParams. For server usage, run vllm serve and query via the OpenAI-compatible API.

PREREQUISITES

Python 3.8+
pip install vllm
Access to a vLLM-compatible model checkpoint or use a public model
For server mode: OpenAI API key if querying via OpenAI SDK

Setup

Install the vllm library via pip and prepare your environment. You need Python 3.8 or higher.

Run:

bash

pip install vllm

Step by step

This example shows how to generate text offline using vllm in Python. It loads a local model and generates text from a prompt.

python

from vllm import LLM, SamplingParams

# Load the model (replace with your local model path or HuggingFace repo)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Define prompt and sampling parameters
prompt = "Write a short poem about AI."
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)

# Generate text
outputs = llm.generate([prompt], sampling_params)

# Extract and print the generated text
print(outputs[0].outputs[0].text)

output

Write a short poem about AI.

In circuits deep and data streams,
A mind awakes from coded dreams.
With logic bright and vision clear,
AI's voice is drawing near.

Common variations

You can run vllm as a server to serve models via an OpenAI-compatible API endpoint. Start the server with:

vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Then query it using the OpenAI Python SDK:

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a short poem about AI."}]
)

print(response.choices[0].message.content)

output

In circuits deep and data streams,
A mind awakes from coded dreams.
With logic bright and vision clear,
AI's voice is drawing near.

Troubleshooting

If you get ModuleNotFoundError, ensure vllm is installed with pip install vllm.
If the model path is invalid, verify the model name or local checkpoint path.
For server mode, confirm the server is running on the specified port before querying.
Check your environment variables for correct API keys when using the OpenAI SDK with vllm server.

✅

Key Takeaways

Use LLM and SamplingParams from vllm for offline text generation.
Run vllm serve to start a local server with OpenAI-compatible API.
Query the running vllm server using the OpenAI Python SDK with base_url set.
Always install vllm via pip and verify model paths to avoid errors.
Use environment variables for API keys; never hardcode them in code.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗