Cheat Sheet intermediate · 8 min read

vLLM Cheat Sheet — Fast LLM Inference Reference — vLLM Refer

version 0.8.x

LLM inference optimized for throughput and latency

install pip install vllm

core imports

python

from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel

Mental model

Batch inference engine that maximizes GPU throughput via continuous batching

Like an airport TSA line: new passengers don't wait for the slowest person to fully clear security. They join at different processing stages, keeping all stations always busy.

Common Inference Patterns

01 Single-Prompt Generation

Generate one text completion from one prompt

python

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

prompt = "What is machine learning?"
outputs = llm.generate([prompt], sampling_params)

for output in outputs:
    print(output.outputs[0].text)

output Machine learning is a subset of artificial intelligence that...

generate() always returns a list, even for single prompts. Access [0].outputs[0].text, not outputs.text.

02 Batch Multiple Prompts

Process 10+ prompts efficiently in one call

python

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf", gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.8, max_tokens=50)

prompts = [
    "Explain quantum computing",
    "What is deep learning?",
    "Define neural networks"
]

outputs = llm.generate(prompts, sampling_params)
for i, output in enumerate(outputs):
    print(f"Prompt {i}: {output.outputs[0].text}")

Batch size is automatic: vLLM uses continuous batching. Don't manually chunk. Pass all prompts at once for best throughput.

03 Stream Tokens in Real-Time

Output tokens as they generate (user-facing API)

python

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)

prompt = "Write a short poem about AI"

for output in llm.generate([prompt], sampling_params, use_tqdm=False):
    for token_output in output.outputs[0].token_ids:
        print(llm.tokenizer.decode(token_output), end="", flush=True)

Stream-per-token requires decoding manually. For HTTP streaming, use vLLM's OpenAI-compatible API server, not the Python library directly.

04 Fine-Tune Sampling Behavior

Control temperature, top-k, top-p, repetition penalty

python

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Deterministic (low temperature)
sampling_params = SamplingParams(
    temperature=0.0,
    top_p=1.0,
    max_tokens=100
)
outputs = llm.generate(["What is 2+2?"], sampling_params)

# Creative (high temperature, top-k)
sampling_params = SamplingParams(
    temperature=1.5,
    top_k=40,
    top_p=0.9,
    repetition_penalty=1.2,
    max_tokens=100
)
outputs = llm.generate(["Tell a creative story"], sampling_params)

temperature=0.0 is deterministic but may break some samplers. Use temperature=1e-6 for guaranteed determinism. repetition_penalty > 1.0 prevents loops, not enforced by default.

05 Load Quantized Models (AWQ, GPTQ)

Reduce memory 4x with minor quality loss

python

from vllm import LLM, SamplingParams

# AWQ quantized model
llm = LLM(
    model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.95
)

# GPTQ quantized model
llm = LLM(
    model="TheBloke/Mistral-7B-Instruct-v0.1-GPTQ",
    quantization="gptq",
    gpu_memory_utilization=0.95
)

sampling_params = SamplingParams(max_tokens=100)
outputs = llm.generate(["Hello, world!"], sampling_params)

Quantization format MUST match model weights (AWQ models require quantization='awq'). Mismatched formats cause silent wrong outputs, not errors. Check model card on HuggingFace.

06 Multi-GPU Tensor Parallelism

70B+ models that don't fit on single GPU

python

from vllm import LLM, SamplingParams

# Split model across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,  # Requires 4x A100 80GB
    gpu_memory_utilization=0.95
)

sampling_params = SamplingParams(max_tokens=100)
prompts = ["Question 1", "Question 2"]
outputs = llm.generate(prompts, sampling_params)

Tensor parallelism requires torch.distributed.launch. Single-process LLM() with tensor_parallel_size > 1 hangs silently. Use torch.distributed.launch --nproc_per_node=4 script.py or don't use tensor_parallel_size.

Key Sampling Parameters

SamplingParams

Parameter	Type	Default	Notes
`temperature`	float	1.0	0.0=deterministic, >1.0=creative. Avoid 0.0, use 1e-6.
`top_p`	float	1.0	Nucleus sampling: keep top 90% of probability mass.
`top_k`	int	-1 (disabled)	Keep top K highest-probability tokens only.
`max_tokens`	int	16	Max output length in tokens. Set higher for long outputs.
`repetition_penalty`	float	1.0	>1.0 discourages repetition. 1.2-1.3 is standard.
`frequency_penalty`	float	0.0	Penalize tokens proportional to frequency in output.
`presence_penalty`	float	0.0	Penalize tokens that have appeared once already.
`best_of`	int	1	Generate N outputs, return highest log-probability one.
`use_beam_search`	bool	False	Enable beam search (slower, more coherent).

Common Errors & Fixes

01 RuntimeError: CUDA out of memory

Cause: Model + batch exceeds GPU memory. Continuous batching fills GPU: requests don't free memory until all finish.

Fix:

python

Reduce gpu_memory_utilization (default 0.9) or max_num_seqs. For 7B models on 8GB GPU: llm = LLM(model='...', gpu_memory_utilization=0.7, max_num_seqs=4). For 70B: use tensor_parallel_size=2+ or load quantized (AWQ) version.

02 AttributeError: 'LLM' object has no attribute 'serve'

Cause: Trying to call .serve() method: doesn't exist in vLLM Python library.

Fix:

bash

Use vLLM's OpenAI-compatible API server instead: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-2-7b-hf. Then query via HTTP like OpenAI's API.

03 AssertionError: len(prompt_token_ids) > 0

Cause: Empty prompt or tokenizer failed to encode. Blank strings, None, or encoding errors.

Fix:

python

Strip and validate prompts before passing: prompts = [p.strip() for p in prompts if p.strip()]. Test tokenizer: llm.tokenizer.encode('test').

04 ValueError: tensor_parallel_size must divide num_gpus

Cause: Requested tensor_parallel_size doesn't match available GPU count.

Fix:

bash

Match exactly: tensor_parallel_size=4 requires 4 GPUs. Check available: nvidia-smi --query-gpu=count --format=csv,noheader | wc -l. Use torch.distributed.launch --nproc_per_node=4 script.py.

Production Gotchas

⚠ Quantization format mismatch causes silent wrong outputs

If you load an AWQ model without quantization='awq', vLLM silently loads full precision weights (if GPU memory allows) or returns garbage. Always check model card for quantization type and pass matching quantization= arg. Wrong: LLM('TheBloke/...-AWQ'). Right: LLM('TheBloke/...-AWQ', quantization='awq').

⚠ max_tokens is relative per-request, not batch-wide

Each prompt can have different max_tokens via different SamplingParams calls. Total output tokens across batch is unpredictable. If you need exactly N tokens per prompt, set max_tokens=N consistently. If batch max_tokens sum > available memory, requests fail mid-batch.

⚠ generate() blocks until all requests finish: no true streaming

llm.generate([prompt1, prompt2]) doesn't return until both complete. For HTTP streaming (token-by-token to client), use the OpenAI-compatible API server (vllm.entrypoints.openai.api_server), not the Python LLM() class directly.

⚠ Continuous batching doesn't reduce latency for single requests

If you call llm.generate([single_prompt]), it runs alone: no batch benefit. Continuous batching helps throughput when many requests arrive concurrently. Single-request latency is unchanged.

⚠ temperature=0.0 can break some samplers

vLLM may fail or behave unexpectedly with temperature=0.0. Use temperature=1e-6 for deterministic output instead. Also: top_p and top_k are sampler-specific: different models may ignore them if using a custom sampler.

⚠ GPU memory isn't freed between generate() calls

Outputs from llm.generate() hold references. Manually delete: `del outputs` before next generate() call if memory is tight. Or call outputs = None. vLLM doesn't auto-garbage-collect between API calls.

Complete production-ready example: batch inference with error handling and memory optimization

python

from vllm import LLM, SamplingParams
import os

def batch_generate(prompts: list, batch_size: int = 32) -> list:
    """
    Generate completions for prompts with continuous batching.
    Automatically chunks large batches to avoid OOM.
    """
    llm = LLM(
        model="meta-llama/Llama-2-7b-hf",
        gpu_memory_utilization=0.85,
        max_num_seqs=16  # Limit concurrent requests
    )
    
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.95,
        max_tokens=150,
        repetition_penalty=1.1
    )
    
    all_outputs = []
    
    # Chunk into smaller batches to control memory
    for i in range(0, len(prompts), batch_size):
        chunk = prompts[i:i+batch_size]
        try:
            outputs = llm.generate(chunk, sampling_params)
            all_outputs.extend([
                output.outputs[0].text for output in outputs
            ])
            # Free memory explicitly
            del outputs
        except RuntimeError as e:
            if "out of memory" in str(e):
                print(f"OOM on batch {i//batch_size}. Retrying with smaller batch...")
                # Retry with half batch size
                for j in range(0, len(chunk), batch_size//2):
                    sub_chunk = chunk[j:j+batch_size//2]
                    outputs = llm.generate(sub_chunk, sampling_params)
                    all_outputs.extend([
                        output.outputs[0].text for output in outputs
                    ])
                    del outputs
            else:
                raise
    
    return all_outputs

# Usage
if __name__ == "__main__":
    test_prompts = [
        "What is machine learning?",
        "Explain quantum computing",
        "Define neural networks"
    ]
    results = batch_generate(test_prompts)
    for prompt, result in zip(test_prompts, results):
        print(f"Q: {prompt}")
        print(f"A: {result}\n")

vLLM Comparison

Aspect	vLLM Python API	vLLM OpenAI API Server
Use Case	Batch inference, local scripts	HTTP API, production microservice
Streaming	Manual token iteration	Server-side HTTP streaming
Setup	pip install vllm; 1 minute	vllm.entrypoints.openai.api_server; 2 minutes
Throughput	Max (continuous batching)	Slightly lower (HTTP overhead)
Client Libraries	Python only	Any language (OpenAI SDK compatible)
Latency	Lower (in-process)	Higher (network round-trip)

Verified 2026-04 · v0.8.x

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.