Fix vLLM CUDA out of memory error
Quick answer
A
vLLM CUDA out of memory error occurs when the GPU memory is insufficient for the requested batch size or sequence length. Reduce batch_size, lower max_tokens, or use sampling_params with smaller max_tokens to fix this. Also, ensure your GPU device memory is properly managed by controlling concurrency and offloading if supported. ERROR TYPE
config_error ⚡ QUICK FIX
Reduce
batch_size or max_tokens in vLLM SamplingParams to fit within your GPU memory limits.Why this happens
The CUDA out of memory error in vLLM arises when the GPU memory cannot accommodate the model's memory footprint plus the batch of input tokens. This typically happens if you set a large batch_size or a high max_tokens for generation, or if your GPU has limited VRAM.
Example triggering code:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate([
"Hello, world!"
] * 16, SamplingParams(max_tokens=512)) # Large batch and token count
This can cause the error:
RuntimeError: CUDA out of memory. Tried to allocate ... bytes.from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate([
"Hello, world!"
] * 16, SamplingParams(max_tokens=512)) # Large batch and token count output
RuntimeError: CUDA out of memory. Tried to allocate ... bytes.
The fix
Reduce batch_size and max_tokens to fit your GPU memory. For example, lower the batch from 16 to 4 and max tokens from 512 to 128. This reduces memory usage per generation call.
Example fixed code:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate([
"Hello, world!"
] * 4, SamplingParams(max_tokens=128)) # Reduced batch and tokens
for output in outputs:
print(output.outputs[0].text) output
Hello, world! ... (model generated text)
Preventing it in production
Implement dynamic batch sizing based on available GPU memory. Use memory profiling tools to monitor VRAM usage. Consider these strategies:
- Use smaller
batch_sizeandmax_tokensduring peak load. - Use
llm.generatewith concurrency limits. - Offload parts of the model to CPU if supported.
- Catch
RuntimeErrorand retry with smaller batch sizes.
Key Takeaways
- Always tune
batch_sizeandmax_tokensto fit your GPU memory. - Use
SamplingParamsto control generation length and memory footprint. - Monitor GPU memory usage and implement retry logic with smaller batches.
- Offload model parts to CPU if your GPU VRAM is limited.
- Catch and handle
RuntimeErrorto maintain app stability.