How to intermediate · 3 min read

vLLM memory optimization techniques

Quick answer
Use vLLM memory optimization techniques like quantization, offloading, and batch size tuning to reduce GPU memory usage. Employ SamplingParams with lower max_tokens and enable tensor_parallel_size to optimize inference memory footprint.

PREREQUISITES

  • Python 3.8+
  • pip install vllm
  • NVIDIA GPU with CUDA support (optional but recommended)

Setup

Install the vllm package and verify your environment supports CUDA for GPU acceleration. Set up your Python environment with the required dependencies.

bash
pip install vllm

Step by step

This example demonstrates how to optimize memory usage by adjusting batch size, using quantization, and limiting token generation with SamplingParams.

python
from vllm import LLM, SamplingParams

# Initialize the LLM with quantization to reduce memory
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=1, quantize="int8")

# Define prompts
prompts = ["Explain memory optimization in vLLM."]

# Set sampling parameters with max_tokens to limit output length
sampling_params = SamplingParams(temperature=0.7, max_tokens=50)

# Generate outputs
outputs = llm.generate(prompts, sampling_params)

# Print the generated text
for output in outputs:
    print(output.outputs[0].text.strip())
output
Memory optimization in vLLM includes techniques like quantization, batch size tuning, and offloading to reduce GPU usage and improve inference efficiency.

Common variations

You can further optimize memory by using smaller batch sizes, enabling CPU offloading, or running inference asynchronously. For example, set tensor_parallel_size to distribute memory load or use quantize="int4" for more aggressive quantization.

python
from vllm import LLM, SamplingParams

# Smaller batch size and int4 quantization
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", tensor_parallel_size=2, quantize="int4")

prompts = ["Optimize memory usage in vLLM."]
sampling_params = SamplingParams(temperature=0.5, max_tokens=30)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text.strip())
output
To optimize memory in vLLM, use smaller batch sizes, int4 quantization, and tensor parallelism to distribute GPU memory load effectively.

Troubleshooting

If you encounter CUDA out of memory errors, reduce batch size or enable quantization. Also, verify your GPU drivers and CUDA toolkit are up to date. For slow inference, check if CPU offloading or tensor parallelism can help balance memory and speed.

Key Takeaways

  • Use quantization (int8 or int4) to significantly reduce GPU memory usage in vLLM.
  • Tune batch size and max_tokens in SamplingParams to control memory footprint during inference.
  • Enable tensor parallelism to distribute model memory across multiple GPUs for large models.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct
Verify ↗