Critical severity intermediate · Fix: 2-5 min

RuntimeError

vllm.RuntimeError: CUDA out of memory

What this error means

vLLM throws a CUDA out of memory RuntimeError when the batch size exceeds available GPU memory capacity.

Stack trace

traceback

Traceback (most recent call last):
  File "generate.py", line 42, in <module>
    outputs = llm.generate(prompts, batch_size=64)  # triggers error
  File "/usr/local/lib/python3.9/site-packages/vllm/llm.py", line 210, in generate
    raise RuntimeError("CUDA out of memory")
vllm.RuntimeError: CUDA out of memory

QUICK FIX

Lower the batch_size parameter in your llm.generate() call to a smaller number that fits your GPU memory.

Why it happens

vLLM processes multiple prompts in batches on GPU. If the batch size is too large relative to the GPU memory, the CUDA runtime cannot allocate enough memory, causing a RuntimeError. This is common when using large models or GPUs with limited VRAM.

Detection

Monitor GPU memory usage before calling generate() and catch RuntimeError exceptions to detect out-of-memory conditions early.

Causes & fixes

Batch size exceeds available GPU memory capacity

✓ Fix

Reduce the batch_size parameter passed to llm.generate() to fit within GPU memory limits.

Model size too large for current GPU memory with given batch size

✓ Fix

Use a smaller model variant or switch to a GPU with more VRAM.

Other processes occupying GPU memory reducing available capacity

✓ Fix

Free GPU memory by terminating other GPU-intensive processes or restart the GPU environment.

Code: broken vs fixed

Broken - triggers the error

python

from vllm import LLM

llm = LLM(model="llama-3.3-70b")
prompts = ["Hello world"] * 64
outputs = llm.generate(prompts, batch_size=64)  # triggers CUDA out of memory error
print(outputs)

Fixed - works correctly

python

import os
from vllm import LLM

os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # ensure correct GPU

llm = LLM(model="llama-3.3-70b")
prompts = ["Hello world"] * 16  # reduced batch size to fit GPU memory
outputs = llm.generate(prompts, batch_size=16)  # fixed: smaller batch size
print(outputs)

Reduced batch_size from 64 to 16 to fit within GPU memory, preventing CUDA out of memory RuntimeError.

⚠

Workaround

Catch the RuntimeError exception and retry generation with a smaller batch size dynamically until it succeeds.

✓

Prevention

Implement dynamic batch sizing based on available GPU memory or use memory profiling tools to set batch size before generation calls.

Python 3.9+ · vllm >=0.1.0 · tested on 0.3.0

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.