How to beginner · 3 min read

How to cache models in Modal

Quick answer

Use Modal's persistent volume mounts or the built-in caching mechanism by storing your model files in a shared volume or the local cache directory inside your Modal function. This avoids re-downloading or reloading models on every invocation, improving speed and reducing costs.

PREREQUISITES

Python 3.8+
Modal account and CLI installed
Modal Python package installed (pip install modal)
Basic knowledge of Python and AI model loading

Setup

Install the modal Python package and log in to your Modal account. Set up your environment with the necessary API keys and prepare your model files for caching.

bash

pip install modal
modal login

output

Successfully logged in to Modal

Step by step

This example shows how to cache a Hugging Face model inside a Modal function using a persistent volume. The volume stores the model files so they are downloaded only once and reused on subsequent runs.

python

import modal

# Create a Modal stub
stub = modal.Stub()

# Create a persistent volume for caching
cache_volume = modal.Volume("model-cache-volume")

@stub.function(
    gpu="A10G",
    image=modal.Image.debian_slim().pip_install("transformers", "torch"),
    mounts=[cache_volume.mount("/cache")]
)
def load_model():
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import os

    model_name = "gpt2"
    cache_dir = "/cache/huggingface"

    # Load tokenizer and model with cache_dir set to persistent volume
    tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
    model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)

    # Simple inference
    inputs = tokenizer("Hello, Modal!", return_tensors="pt")
    outputs = model.generate(**inputs, max_length=20)
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return text

if __name__ == "__main__":
    with stub.run():
        print(load_model())

output

Hello, Modal!

Common variations

Use modal.Secret to securely pass API keys if your model requires authentication.
For CPU-only usage, omit the gpu parameter.
Cache other model types by mounting volumes similarly.
Use modal.Cache decorator for simple function-level caching (experimental).

python

import modal

stub = modal.Stub()

@stub.function(cache=True)
def cached_function(x):
    # This function's output is cached by Modal automatically
    return x * 2

if __name__ == "__main__":
    with stub.run():
        print(cached_function(10))  # Cached result

output

Troubleshooting

If your model is not caching, ensure the volume is correctly mounted and persistent.
Check Modal logs for volume mount errors.
Verify you have enough volume storage quota in your Modal account.
For large models, increase volume size or use external storage solutions.

✅

Key Takeaways

Use Modal persistent volumes to cache model files across function invocations.
Mount volumes inside Modal functions to store and reuse large models efficiently.
The cache_dir parameter in model loading libraries enables local caching.
Modal's cache=True decorator provides simple function output caching.
Monitor volume usage and logs to troubleshoot caching issues.

Verified 2026-04 · gpt2

Verify ↗