How to minimize Modal costs
Quick answer
To minimize Modal costs, optimize your function resource allocation by specifying minimal GPU/CPU requirements and use lightweight container images. Also, batch requests and reuse deployed stubs to avoid repeated cold starts and reduce runtime duration.
PREREQUISITES
Python 3.8+Modal account with API tokenpip install modal
Setup
Install the modal Python package and set your API key as an environment variable for secure authentication.
pip install modal output
Collecting modal Downloading modal-1.0.0-py3-none-any.whl (50 kB) Installing collected packages: modal Successfully installed modal-1.0.0
Step by step
Use minimal resource specifications and reuse stubs to reduce costs. Deploy lightweight images and batch your inference calls.
import os
import modal
# Set your Modal API key in environment variable
# export MODAL_API_KEY=os.environ["MODAL_API_KEY"]
app = modal.App("cost-optimized-app")
# Define a function with minimal GPU and CPU resources
@modal.function(gpu="A10G", cpu=1, memory=2048, image=modal.Image.debian_slim())
def run_inference(prompt: str) -> str:
# Simulate lightweight inference
return f"Processed: {prompt}"
if __name__ == "__main__":
# Reuse stub to avoid redeploying
stub = app.deploy(run_inference)
# Batch multiple prompts to reduce invocation overhead
prompts = ["Hello", "How are you?", "Modal cost optimization"]
for p in prompts:
result = stub.run_inference(p)
print(result) output
Processed: Hello Processed: How are you? Processed: Modal cost optimization
Common variations
You can use asynchronous functions to handle multiple requests concurrently, further reducing runtime. Also, consider using CPU-only functions if GPU is not required, and leverage modal.Image caching to speed up deployments.
import modal
import asyncio
app = modal.App("async-cost-optimized")
@modal.function(cpu=1, memory=1024)
async def async_inference(prompt: str) -> str:
# Simulate async processing
await asyncio.sleep(0.1)
return f"Async processed: {prompt}"
async def main():
stub = app.deploy(async_inference)
prompts = ["Async Hello", "Async World"]
tasks = [stub.async_inference(p) for p in prompts]
results = await asyncio.gather(*tasks)
for r in results:
print(r)
if __name__ == "__main__":
asyncio.run(main()) output
Async processed: Async Hello Async processed: Async World
Troubleshooting
If you notice unexpectedly high costs, check for frequent cold starts by reusing stubs and avoid over-provisioning resources. Monitor your function runtimes and reduce memory or GPU allocation if possible. Use Modal's dashboard to track usage and optimize accordingly.
Key Takeaways
- Specify minimal CPU, GPU, and memory resources in @modal.function to reduce costs.
- Reuse deployed stubs to avoid repeated cold starts and deployment overhead.
- Batch multiple requests in a single invocation to maximize efficiency.
- Use asynchronous functions for concurrent processing to reduce total runtime.
- Monitor usage via Modal dashboard and adjust resource allocation accordingly.