How to beginner · 3 min read

How to minimize Modal costs

Quick answer
To minimize Modal costs, optimize your function resource allocation by specifying minimal GPU/CPU requirements and use lightweight container images. Also, batch requests and reuse deployed stubs to avoid repeated cold starts and reduce runtime duration.

PREREQUISITES

  • Python 3.8+
  • Modal account with API token
  • pip install modal

Setup

Install the modal Python package and set your API key as an environment variable for secure authentication.

bash
pip install modal
output
Collecting modal
  Downloading modal-1.0.0-py3-none-any.whl (50 kB)
Installing collected packages: modal
Successfully installed modal-1.0.0

Step by step

Use minimal resource specifications and reuse stubs to reduce costs. Deploy lightweight images and batch your inference calls.

python
import os
import modal

# Set your Modal API key in environment variable
# export MODAL_API_KEY=os.environ["MODAL_API_KEY"]

app = modal.App("cost-optimized-app")

# Define a function with minimal GPU and CPU resources
@modal.function(gpu="A10G", cpu=1, memory=2048, image=modal.Image.debian_slim())
def run_inference(prompt: str) -> str:
    # Simulate lightweight inference
    return f"Processed: {prompt}"

if __name__ == "__main__":
    # Reuse stub to avoid redeploying
    stub = app.deploy(run_inference)

    # Batch multiple prompts to reduce invocation overhead
    prompts = ["Hello", "How are you?", "Modal cost optimization"]
    for p in prompts:
        result = stub.run_inference(p)
        print(result)
output
Processed: Hello
Processed: How are you?
Processed: Modal cost optimization

Common variations

You can use asynchronous functions to handle multiple requests concurrently, further reducing runtime. Also, consider using CPU-only functions if GPU is not required, and leverage modal.Image caching to speed up deployments.

python
import modal
import asyncio

app = modal.App("async-cost-optimized")

@modal.function(cpu=1, memory=1024)
async def async_inference(prompt: str) -> str:
    # Simulate async processing
    await asyncio.sleep(0.1)
    return f"Async processed: {prompt}"

async def main():
    stub = app.deploy(async_inference)
    prompts = ["Async Hello", "Async World"]
    tasks = [stub.async_inference(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for r in results:
        print(r)

if __name__ == "__main__":
    asyncio.run(main())
output
Async processed: Async Hello
Async processed: Async World

Troubleshooting

If you notice unexpectedly high costs, check for frequent cold starts by reusing stubs and avoid over-provisioning resources. Monitor your function runtimes and reduce memory or GPU allocation if possible. Use Modal's dashboard to track usage and optimize accordingly.

Key Takeaways

  • Specify minimal CPU, GPU, and memory resources in @modal.function to reduce costs.
  • Reuse deployed stubs to avoid repeated cold starts and deployment overhead.
  • Batch multiple requests in a single invocation to maximize efficiency.
  • Use asynchronous functions for concurrent processing to reduce total runtime.
  • Monitor usage via Modal dashboard and adjust resource allocation accordingly.
Verified 2026-04
Verify ↗