How to beginner · 3 min read

How to minimize Modal costs

Quick answer

To minimize Modal costs, optimize your function resource allocation by specifying minimal GPU/CPU requirements and use lightweight container images. Also, batch requests and reuse deployed stubs to avoid repeated cold starts and reduce runtime duration.

PREREQUISITES

Python 3.8+
Modal account with API token
pip install modal

Setup

Install the modal Python package and set your API key as an environment variable for secure authentication.

bash

pip install modal

output

Collecting modal
  Downloading modal-1.0.0-py3-none-any.whl (50 kB)
Installing collected packages: modal
Successfully installed modal-1.0.0

Step by step

Use minimal resource specifications and reuse stubs to reduce costs. Deploy lightweight images and batch your inference calls.

python

import os
import modal

# Set your Modal API key in environment variable
# export MODAL_API_KEY=os.environ["MODAL_API_KEY"]

app = modal.App("cost-optimized-app")

# Define a function with minimal GPU and CPU resources
@modal.function(gpu="A10G", cpu=1, memory=2048, image=modal.Image.debian_slim())
def run_inference(prompt: str) -> str:
    # Simulate lightweight inference
    return f"Processed: {prompt}"

if __name__ == "__main__":
    # Reuse stub to avoid redeploying
    stub = app.deploy(run_inference)

    # Batch multiple prompts to reduce invocation overhead
    prompts = ["Hello", "How are you?", "Modal cost optimization"]
    for p in prompts:
        result = stub.run_inference(p)
        print(result)

output

Processed: Hello
Processed: How are you?
Processed: Modal cost optimization

Common variations

You can use asynchronous functions to handle multiple requests concurrently, further reducing runtime. Also, consider using CPU-only functions if GPU is not required, and leverage modal.Image caching to speed up deployments.

python

import modal
import asyncio

app = modal.App("async-cost-optimized")

@modal.function(cpu=1, memory=1024)
async def async_inference(prompt: str) -> str:
    # Simulate async processing
    await asyncio.sleep(0.1)
    return f"Async processed: {prompt}"

async def main():
    stub = app.deploy(async_inference)
    prompts = ["Async Hello", "Async World"]
    tasks = [stub.async_inference(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for r in results:
        print(r)

if __name__ == "__main__":
    asyncio.run(main())

output

Async processed: Async Hello
Async processed: Async World

Troubleshooting

If you notice unexpectedly high costs, check for frequent cold starts by reusing stubs and avoid over-provisioning resources. Monitor your function runtimes and reduce memory or GPU allocation if possible. Use Modal's dashboard to track usage and optimize accordingly.

✅

Key Takeaways

Specify minimal CPU, GPU, and memory resources in @modal.function to reduce costs.
Reuse deployed stubs to avoid repeated cold starts and deployment overhead.
Batch multiple requests in a single invocation to maximize efficiency.
Use asynchronous functions for concurrent processing to reduce total runtime.
Monitor usage via Modal dashboard and adjust resource allocation accordingly.

Verified 2026-04

Verify ↗