How to intermediate · 3 min read

Fix Modal GPU out of memory

Q: Fix Modal GPU out of memory

To fix Modal GPU out of memory errors, reduce the batch size or input size in your function, switch to a smaller model, or use mixed precision if supported. Also, ensure you release GPU memory properly by deleting unused variables and calling torch.cuda.empty_cache() inside your @app.function.

Quick answer

To fix Modal GPU out of memory errors, reduce the batch size or input size in your function, switch to a smaller model, or use mixed precision if supported. Also, ensure you release GPU memory properly by deleting unused variables and calling torch.cuda.empty_cache() inside your @app.function.

PREREQUISITES

Python 3.8+
Modal account and CLI installed
Modal Python package installed (pip install modal)
Basic knowledge of GPU programming and PyTorch

Setup

Install the modal package and ensure you have a Modal account with GPU quota. Set your Modal API token as an environment variable.

Install Modal SDK: pip install modal
Login to Modal CLI: modal login
Set environment variable: export MODAL_API_TOKEN=your_token_here

bash

pip install modal

output

Collecting modal
  Downloading modal-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: modal
Successfully installed modal-1.x.x

Step by step

Use a smaller batch size and clear GPU cache inside your Modal function to avoid out of memory errors. Here is a complete example that loads a model, runs inference with batch size 1, and clears cache after use.

python

import modal
import torch

app = modal.App()

@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("torch"))
def run_inference(prompt: str) -> str:
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # Load smaller model to reduce memory usage
    model_name = "gpt2"  # smaller than large models
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Use batch size 1
    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=50)

    # Clear GPU cache
    del inputs, model
    torch.cuda.empty_cache()

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

if __name__ == "__main__":
    with modal.runner.deploy_stub(app):
        result = run_inference.remote("Hello, Modal GPU memory!")
        print(result)

output

Hello, Modal GPU memory! (and generated text...)

Common variations

You can also try these variations to reduce GPU memory usage:

Use mixed precision by enabling torch.cuda.amp.autocast() during inference.
Switch to smaller or quantized models if available.
Use CPU instead of GPU by removing gpu="A10G" in @modal.function.
Stream outputs to reduce memory footprint.

python

import modal

app = modal.App()

@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("torch transformers"))
def run_inference_mixed_precision(prompt: str) -> str:
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model_name = "gpt2"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad(), torch.cuda.amp.autocast():
        outputs = model.generate(**inputs, max_length=50)

    del inputs, model
    torch.cuda.empty_cache()

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

output

Hello, Modal GPU memory! (and generated text...)

Troubleshooting

If you still get CUDA out of memory errors:

Reduce batch size or input length further.
Restart your Modal function to clear stale GPU memory.
Check for memory leaks by ensuring variables are deleted and cache cleared.
Use modal logs to inspect runtime errors.
Consider using a GPU with more memory or switch to CPU for testing.

✅

Key Takeaways

Reduce batch size and input length to fit GPU memory constraints in Modal functions.
Clear GPU memory with torch.cuda.empty_cache() after inference to prevent leaks.
Use smaller or quantized models to lower GPU memory usage.
Enable mixed precision with torch.cuda.amp.autocast() to save memory during inference.
Monitor Modal logs and restart functions if persistent out of memory errors occur.

Verified 2026-04 · gpt2

Verify ↗