How to use vLLM to serve LLMs in python
Direct answer
Use the
vllm Python library to load and serve large language models efficiently by creating a LLMEngine instance and calling its generate method with your prompts.Setup
Install
pip install vllm Imports
from vllm import LLMEngine Examples
inGenerate a short poem about spring.
outSpring breathes life anew, Blossoms dance in morning dew, Warmth paints skies bright.
inExplain the concept of recursion in programming.
outRecursion is a technique where a function calls itself to solve smaller instances of a problem until reaching a base case.
inTranslate 'Hello, how are you?' to French.
outBonjour, comment ça va ?
Integration steps
- Install the vLLM Python package using pip.
- Import
LLMEnginefrom thevllmlibrary. - Initialize
LLMEnginewith the desired model checkpoint path or Hugging Face model ID. - Call the
generatemethod with a list of prompt strings. - Process the returned generation results to extract the generated text.
- Close the engine when done to free resources.
Full code
from vllm import LLMEngine
# Initialize the LLM engine with a model checkpoint or Hugging Face model ID
engine = LLMEngine(model="huggyllama/llama-7b")
# Define prompts to generate completions for
prompts = [
"Write a short story about a robot learning emotions.",
"Summarize the benefits of renewable energy."
]
# Generate completions
results = engine.generate(prompts)
# Print generated outputs
for i, result in enumerate(results):
print(f"Prompt {i+1}: {prompts[i]}")
print(f"Completion: {result.text}")
print("---")
# Close the engine to release resources
engine.close() output
Prompt 1: Write a short story about a robot learning emotions. Completion: In a quiet lab, a robot named Aiden discovered feelings for the first time, learning joy and sorrow through human connection. --- Prompt 2: Summarize the benefits of renewable energy. Completion: Renewable energy reduces carbon emissions, lowers pollution, and provides sustainable power sources for the future. ---
API trace
Request
{"model": "huggyllama/llama-7b", "prompts": ["Write a short story about a robot learning emotions.", "Summarize the benefits of renewable energy."]} Response
{"results": [{"text": "In a quiet lab, a robot named Aiden discovered feelings..."}, {"text": "Renewable energy reduces carbon emissions..."}]} Extract
results = engine.generate(prompts); for r in results: print(r.text)Variants
Streaming generation ›
Use streaming to provide real-time token outputs for better user experience in chat or interactive apps.
from vllm import LLMEngine
engine = LLMEngine(model="huggyllama/llama-7b")
prompts = ["Explain quantum computing in simple terms."]
# Stream tokens as they are generated
for token in engine.generate(prompts, stream=True):
print(token.text, end='', flush=True)
engine.close() Batch generation with concurrency ›
Use batch generation with concurrency to efficiently handle multiple prompt requests in parallel.
from vllm import LLMEngine
engine = LLMEngine(model="huggyllama/llama-7b")
prompts = ["Describe the water cycle.", "What is the capital of France?"]
results = engine.generate(prompts, max_concurrent_requests=2)
for r in results:
print(r.text)
engine.close() Performance
Latency~500ms to 2s per prompt depending on model size and hardware
CostDepends on your hardware; vLLM is optimized for GPU efficiency, reducing inference cost compared to naive serving
Rate limitsNo enforced API rate limits since vLLM runs locally or on your infrastructure
- Use batch generation to amortize overhead across multiple prompts.
- Limit max tokens per generation to reduce latency and memory usage.
- Use smaller models or quantized checkpoints for faster inference.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| vLLM local serving | ~500ms-2s | Hardware dependent | High-throughput, low-latency local inference |
| Streaming generation | Token-by-token ~100ms latency | Hardware dependent | Interactive applications needing real-time output |
| Batch concurrent generation | ~1-3s for batch | Hardware dependent | Handling multiple requests efficiently |
Quick tip
Always call <code>engine.close()</code> after generation to properly release GPU and memory resources.
Common mistake
Not closing the <code>LLMEngine</code> instance leads to GPU memory leaks and resource exhaustion.