RuntimeError
vllm.RuntimeError: CUDA out of memory
Stack trace
Traceback (most recent call last):
File "generate.py", line 42, in <module>
outputs = llm.generate(prompts, batch_size=64) # triggers error
File "/usr/local/lib/python3.9/site-packages/vllm/llm.py", line 210, in generate
raise RuntimeError("CUDA out of memory")
vllm.RuntimeError: CUDA out of memory Why it happens
vLLM processes multiple prompts in batches on GPU. If the batch size is too large relative to the GPU memory, the CUDA runtime cannot allocate enough memory, causing a RuntimeError. This is common when using large models or GPUs with limited VRAM.
Detection
Monitor GPU memory usage before calling generate() and catch RuntimeError exceptions to detect out-of-memory conditions early.
Causes & fixes
Batch size exceeds available GPU memory capacity
Reduce the batch_size parameter passed to llm.generate() to fit within GPU memory limits.
Model size too large for current GPU memory with given batch size
Use a smaller model variant or switch to a GPU with more VRAM.
Other processes occupying GPU memory reducing available capacity
Free GPU memory by terminating other GPU-intensive processes or restart the GPU environment.
Code: broken vs fixed
from vllm import LLM
llm = LLM(model="llama-3.3-70b")
prompts = ["Hello world"] * 64
outputs = llm.generate(prompts, batch_size=64) # triggers CUDA out of memory error
print(outputs) import os
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # ensure correct GPU
llm = LLM(model="llama-3.3-70b")
prompts = ["Hello world"] * 16 # reduced batch size to fit GPU memory
outputs = llm.generate(prompts, batch_size=16) # fixed: smaller batch size
print(outputs) Workaround
Catch the RuntimeError exception and retry generation with a smaller batch size dynamically until it succeeds.
Prevention
Implement dynamic batch sizing based on available GPU memory or use memory profiling tools to set batch size before generation calls.