High severity intermediate · Fix: 2-5 min

RuntimeError

torch.cuda.OutOfMemoryError

What this error means
This error occurs when the GPU runs out of memory while processing a batch of embeddings, causing the embedding generation to fail.

Stack trace

traceback
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 8.00 GiB total capacity; 6.50 GiB already allocated; 256.00 MiB free; 6.75 GiB reserved in total by PyTorch)
QUICK FIX
Immediately reduce the batch size for embedding generation to avoid exceeding GPU memory limits.

Why it happens

When generating embeddings in batches on a GPU, the batch size or model size can exceed the available GPU memory. This causes PyTorch to raise a CUDA out of memory error because it cannot allocate enough memory for the tensors required during embedding computation.

Detection

Monitor GPU memory usage during embedding batch processing using tools like nvidia-smi or PyTorch's memory_allocated() to detect when memory limits are approached before the error occurs.

Causes & fixes

1

Batch size is too large for the available GPU memory.

✓ Fix

Reduce the batch size for embedding generation to fit within the GPU memory limits.

2

Multiple processes or models are using the GPU simultaneously, reducing available memory.

✓ Fix

Ensure exclusive GPU access or reduce concurrent GPU workloads to free memory for embedding tasks.

3

Model or embedding dimension is too large for the GPU capacity.

✓ Fix

Use a smaller embedding model or reduce embedding dimensionality to lower memory consumption.

4

GPU memory fragmentation due to previous allocations not being freed.

✓ Fix

Restart the Python process or clear GPU cache with torch.cuda.empty_cache() before running embeddings.

Code: broken vs fixed

Broken - triggers the error
python
import os
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2').cuda()

texts = ['text1', 'text2', 'text3', 'text4', 'text5', 'text6', 'text7', 'text8']
batch_size = 8  # Too large for GPU memory

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    inputs = tokenizer(batch, padding=True, truncation=True, return_tensors='pt').to('cuda')
    outputs = model(**inputs)  # RuntimeError: CUDA out of memory here
Fixed - works correctly
python
import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2').cuda()

texts = ['text1', 'text2', 'text3', 'text4', 'text5', 'text6', 'text7', 'text8']
batch_size = 2  # Reduced batch size to fit GPU memory

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    inputs = tokenizer(batch, padding=True, truncation=True, return_tensors='pt').to('cuda')
    outputs = model(**inputs)
    print(f'Processed batch {i//batch_size + 1}')
Reduced batch size from 8 to 2 to fit within GPU memory limits and prevent CUDA out of memory errors.

Workaround

Catch the RuntimeError, clear GPU cache with torch.cuda.empty_cache(), and retry with a smaller batch size dynamically.

Prevention

Implement dynamic batch sizing based on available GPU memory and monitor memory usage to avoid over-allocation during embedding generation.

Python 3.9+ · transformers >=4.0.0 · tested on 4.30.0
Verified 2026-04
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.