How to intermediate · 3 min read

Fix Modal GPU out of memory

Quick answer
To fix Modal GPU out of memory errors, reduce the batch size or input size in your function, switch to a smaller model, or use mixed precision if supported. Also, ensure you release GPU memory properly by deleting unused variables and calling torch.cuda.empty_cache() inside your @app.function.

PREREQUISITES

  • Python 3.8+
  • Modal account and CLI installed
  • Modal Python package installed (pip install modal)
  • Basic knowledge of GPU programming and PyTorch

Setup

Install the modal package and ensure you have a Modal account with GPU quota. Set your Modal API token as an environment variable.

  • Install Modal SDK: pip install modal
  • Login to Modal CLI: modal login
  • Set environment variable: export MODAL_API_TOKEN=your_token_here
bash
pip install modal
output
Collecting modal
  Downloading modal-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: modal
Successfully installed modal-1.x.x

Step by step

Use a smaller batch size and clear GPU cache inside your Modal function to avoid out of memory errors. Here is a complete example that loads a model, runs inference with batch size 1, and clears cache after use.

python
import modal
import torch

app = modal.App()

@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("torch"))
def run_inference(prompt: str) -> str:
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # Load smaller model to reduce memory usage
    model_name = "gpt2"  # smaller than large models
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Use batch size 1
    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=50)

    # Clear GPU cache
    del inputs, model
    torch.cuda.empty_cache()

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

if __name__ == "__main__":
    with modal.runner.deploy_stub(app):
        result = run_inference.remote("Hello, Modal GPU memory!")
        print(result)
output
Hello, Modal GPU memory! (and generated text...)

Common variations

You can also try these variations to reduce GPU memory usage:

  • Use mixed precision by enabling torch.cuda.amp.autocast() during inference.
  • Switch to smaller or quantized models if available.
  • Use CPU instead of GPU by removing gpu="A10G" in @modal.function.
  • Stream outputs to reduce memory footprint.
python
import modal

app = modal.App()

@modal.function(gpu="A10G", image=modal.Image.debian_slim().pip_install("torch transformers"))
def run_inference_mixed_precision(prompt: str) -> str:
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model_name = "gpt2"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad(), torch.cuda.amp.autocast():
        outputs = model.generate(**inputs, max_length=50)

    del inputs, model
    torch.cuda.empty_cache()

    return tokenizer.decode(outputs[0], skip_special_tokens=True)
output
Hello, Modal GPU memory! (and generated text...)

Troubleshooting

If you still get CUDA out of memory errors:

  • Reduce batch size or input length further.
  • Restart your Modal function to clear stale GPU memory.
  • Check for memory leaks by ensuring variables are deleted and cache cleared.
  • Use modal logs to inspect runtime errors.
  • Consider using a GPU with more memory or switch to CPU for testing.

Key Takeaways

  • Reduce batch size and input length to fit GPU memory constraints in Modal functions.
  • Clear GPU memory with torch.cuda.empty_cache() after inference to prevent leaks.
  • Use smaller or quantized models to lower GPU memory usage.
  • Enable mixed precision with torch.cuda.amp.autocast() to save memory during inference.
  • Monitor Modal logs and restart functions if persistent out of memory errors occur.
Verified 2026-04 · gpt2
Verify ↗