vLLM Cheat Sheet — Fast LLM Inference Reference — vLLM Refer
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel Batch inference engine that maximizes GPU throughput via continuous batching
Like an airport TSA line: new passengers don't wait for the slowest person to fully clear security. They join at different processing stages, keeping all stations always busy.
Common Inference Patterns
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
prompt = "What is machine learning?"
outputs = llm.generate([prompt], sampling_params)
for output in outputs:
print(output.outputs[0].text) Machine learning is a subset of artificial intelligence that... from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.8, max_tokens=50)
prompts = [
"Explain quantum computing",
"What is deep learning?",
"Define neural networks"
]
outputs = llm.generate(prompts, sampling_params)
for i, output in enumerate(outputs):
print(f"Prompt {i}: {output.outputs[0].text}") from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
prompt = "Write a short poem about AI"
for output in llm.generate([prompt], sampling_params, use_tqdm=False):
for token_output in output.outputs[0].token_ids:
print(llm.tokenizer.decode(token_output), end="", flush=True) from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
# Deterministic (low temperature)
sampling_params = SamplingParams(
temperature=0.0,
top_p=1.0,
max_tokens=100
)
outputs = llm.generate(["What is 2+2?"], sampling_params)
# Creative (high temperature, top-k)
sampling_params = SamplingParams(
temperature=1.5,
top_k=40,
top_p=0.9,
repetition_penalty=1.2,
max_tokens=100
)
outputs = llm.generate(["Tell a creative story"], sampling_params) from vllm import LLM, SamplingParams
# AWQ quantized model
llm = LLM(
model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
quantization="awq",
gpu_memory_utilization=0.95
)
# GPTQ quantized model
llm = LLM(
model="TheBloke/Mistral-7B-Instruct-v0.1-GPTQ",
quantization="gptq",
gpu_memory_utilization=0.95
)
sampling_params = SamplingParams(max_tokens=100)
outputs = llm.generate(["Hello, world!"], sampling_params) from vllm import LLM, SamplingParams
# Split model across 4 GPUs
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4, # Requires 4x A100 80GB
gpu_memory_utilization=0.95
)
sampling_params = SamplingParams(max_tokens=100)
prompts = ["Question 1", "Question 2"]
outputs = llm.generate(prompts, sampling_params) Key Sampling Parameters
SamplingParams
| Parameter | Type | Default | Notes |
|---|---|---|---|
temperature | float | 1.0 | 0.0=deterministic, >1.0=creative. Avoid 0.0, use 1e-6. |
top_p | float | 1.0 | Nucleus sampling: keep top 90% of probability mass. |
top_k | int | -1 (disabled) | Keep top K highest-probability tokens only. |
max_tokens | int | 16 | Max output length in tokens. Set higher for long outputs. |
repetition_penalty | float | 1.0 | >1.0 discourages repetition. 1.2-1.3 is standard. |
frequency_penalty | float | 0.0 | Penalize tokens proportional to frequency in output. |
presence_penalty | float | 0.0 | Penalize tokens that have appeared once already. |
best_of | int | 1 | Generate N outputs, return highest log-probability one. |
use_beam_search | bool | False | Enable beam search (slower, more coherent). |
Common Errors & Fixes
RuntimeError: CUDA out of memory Cause: Model + batch exceeds GPU memory. Continuous batching fills GPU: requests don't free memory until all finish.
Reduce gpu_memory_utilization (default 0.9) or max_num_seqs. For 7B models on 8GB GPU: llm = LLM(model='...', gpu_memory_utilization=0.7, max_num_seqs=4). For 70B: use tensor_parallel_size=2+ or load quantized (AWQ) version. AttributeError: 'LLM' object has no attribute 'serve' Cause: Trying to call .serve() method: doesn't exist in vLLM Python library.
Use vLLM's OpenAI-compatible API server instead: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf. Then query via HTTP like OpenAI's API. AssertionError: len(prompt_token_ids) > 0 Cause: Empty prompt or tokenizer failed to encode. Blank strings, None, or encoding errors.
Strip and validate prompts before passing: prompts = [p.strip() for p in prompts if p.strip()]. Test tokenizer: llm.tokenizer.encode('test'). ValueError: tensor_parallel_size must divide num_gpus Cause: Requested tensor_parallel_size doesn't match available GPU count.
Match exactly: tensor_parallel_size=4 requires 4 GPUs. Check available: nvidia-smi --query-gpu=count --format=csv,noheader | wc -l. Use torch.distributed.launch --nproc_per_node=4 script.py. Production Gotchas
If you load an AWQ model without quantization='awq', vLLM silently loads full precision weights (if GPU memory allows) or returns garbage. Always check model card for quantization type and pass matching quantization= arg. Wrong: LLM('TheBloke/...-AWQ'). Right: LLM('TheBloke/...-AWQ', quantization='awq').
Each prompt can have different max_tokens via different SamplingParams calls. Total output tokens across batch is unpredictable. If you need exactly N tokens per prompt, set max_tokens=N consistently. If batch max_tokens sum > available memory, requests fail mid-batch.
llm.generate([prompt1, prompt2]) doesn't return until both complete. For HTTP streaming (token-by-token to client), use the OpenAI-compatible API server (vllm.entrypoints.openai.api_server), not the Python LLM() class directly.
If you call llm.generate([single_prompt]), it runs alone: no batch benefit. Continuous batching helps throughput when many requests arrive concurrently. Single-request latency is unchanged.
vLLM may fail or behave unexpectedly with temperature=0.0. Use temperature=1e-6 for deterministic output instead. Also: top_p and top_k are sampler-specific: different models may ignore them if using a custom sampler.
Outputs from llm.generate() hold references. Manually delete: `del outputs` before next generate() call if memory is tight. Or call outputs = None. vLLM doesn't auto-garbage-collect between API calls.
Complete production-ready example: batch inference with error handling and memory optimization
from vllm import LLM, SamplingParams
import os
def batch_generate(prompts: list, batch_size: int = 32) -> list:
"""
Generate completions for prompts with continuous batching.
Automatically chunks large batches to avoid OOM.
"""
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
gpu_memory_utilization=0.85,
max_num_seqs=16 # Limit concurrent requests
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=150,
repetition_penalty=1.1
)
all_outputs = []
# Chunk into smaller batches to control memory
for i in range(0, len(prompts), batch_size):
chunk = prompts[i:i+batch_size]
try:
outputs = llm.generate(chunk, sampling_params)
all_outputs.extend([
output.outputs[0].text for output in outputs
])
# Free memory explicitly
del outputs
except RuntimeError as e:
if "out of memory" in str(e):
print(f"OOM on batch {i//batch_size}. Retrying with smaller batch...")
# Retry with half batch size
for j in range(0, len(chunk), batch_size//2):
sub_chunk = chunk[j:j+batch_size//2]
outputs = llm.generate(sub_chunk, sampling_params)
all_outputs.extend([
output.outputs[0].text for output in outputs
])
del outputs
else:
raise
return all_outputs
# Usage
if __name__ == "__main__":
test_prompts = [
"What is machine learning?",
"Explain quantum computing",
"Define neural networks"
]
results = batch_generate(test_prompts)
for prompt, result in zip(test_prompts, results):
print(f"Q: {prompt}")
print(f"A: {result}\n") vLLM Comparison
| Aspect | vLLM Python API | vLLM OpenAI API Server |
|---|---|---|
| Use Case | Batch inference, local scripts | HTTP API, production microservice |
| Streaming | Manual token iteration | Server-side HTTP streaming |
| Setup | pip install vllm; 1 minute | vllm.entrypoints.openai.api_server; 2 minutes |
| Throughput | Max (continuous batching) | Slightly lower (HTTP overhead) |
| Client Libraries | Python only | Any language (OpenAI SDK compatible) |
| Latency | Lower (in-process) | Higher (network round-trip) |