Replicate API Cheat Sheet — Run ML Models in Production
import replicate
import os
from typing import Optional HTTP API to run open-source ML models on serverless infrastructure.
Like calling Uber instead of owning a car fleet. You request a model run, it executes on shared hardware, you get results back: pay per execution.
Core Patterns
import replicate
import os
output = replicate.run(
"stability-ai/stable-diffusion-v3",
input={
"prompt": "a cat wearing sunglasses",
"num_outputs": 1,
"height": 768,
"width": 768,
},
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(output[0]) # URL or list of outputs ["https://cdn.replicate.com/output_uuid.jpg"] import replicate
import os
import time
# Start prediction, don't wait
prediction = replicate.predictions.create(
version="deepseek-ai/deepseek-v2-chat",
input={
"prompt": "Explain quantum computing in 500 words",
"max_tokens": 500,
},
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(f"Prediction ID: {prediction.id}")
print(f"Status: {prediction.status}")
# Poll until done
while prediction.status not in ["succeeded", "failed", "canceled"]:
time.sleep(2)
prediction = replicate.predictions.get(
prediction.id,
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(f"Status: {prediction.status}")
if prediction.status == "succeeded":
print(f"Output: {prediction.output}")
else:
print(f"Error: {prediction.error}") Status: processing
Status: processing
Status: succeeded
Output: ['Quantum computing leverages...'] import replicate
import os
# Start prediction with webhook
prediction = replicate.predictions.create(
version="openai/whisper",
input={
"audio": "https://example.com/audio.mp3",
"model": "large",
"language": "en"
},
webhook=os.environ["WEBHOOK_URL"], # POST to this URL when done
webhook_events_filter=["completed"], # only notify on success
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(f"Submitted: {prediction.id}")
print(f"Webhook will POST to: {os.environ['WEBHOOK_URL']}") Submitted: 12abc345def
Webhook will POST to: https://api.example.com/webhooks/replicate import replicate
import os
import json
# Create batch prediction
batch = replicate.batch.create(
version="stability-ai/stable-diffusion-v3",
inputs=[
{"prompt": "a dog", "num_outputs": 1},
{"prompt": "a cat", "num_outputs": 1},
{"prompt": "a bird", "num_outputs": 1},
],
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(f"Batch ID: {batch.id}")
print(f"Predictions: {len(batch.predictions)}")
# Poll batch status
while not batch.completed_at:
batch = replicate.batch.get(
batch.id,
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(f"Succeeded: {batch.succeeded_count}, Failed: {batch.failed_count}")
time.sleep(3)
for pred in batch.predictions:
print(f"{pred.input['prompt']}: {pred.output}") Batch ID: 8a9b0c1d2e3f4g5h
Succeeded: 3, Failed: 0 import replicate
import os
# Version ID locked = same results forever
output = replicate.run(
"stability-ai/stable-diffusion:ac732df83cea7fff18b8472768c88ad041fa750ff7682a21aef6f8f3a9eaa50d",
input={
"prompt": "a red dragon",
"num_inference_steps": 50,
},
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(output) ["https://cdn.replicate.com/output.jpg"] import replicate
import os
prediction = replicate.predictions.create(
version="meta/llama-2-70b-chat",
input={"prompt": "long task"},
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print(f"Running: {prediction.id}")
# Cancel it
replicate.predictions.cancel(
prediction.id,
api_token=os.environ["REPLICATE_API_TOKEN"]
)
print("Canceled") Running: pred_123
Canceled Common Input Parameters by Model Type
Prediction Parameters
| Parameter | Type | Default | Notes |
|---|---|---|---|
prompt | string | required | Text description (image gen, text models). Max 2000 chars typically. |
num_outputs | int | 1 | Number of results to return. Image models: 1-4. More = higher cost. |
num_inference_steps | int | 20-50 | Quality/speed tradeoff. 20=fast/blurry, 50+=slow/sharp. Image models only. |
guidance_scale | float | 7.5 | Prompt adherence (image gen). 1.0=ignore prompt, 20+=over-literal. |
seed | int | random | Deterministic output. Same seed + same params = identical output. |
temperature | float | 0.8 | Text generation randomness. 0=deterministic, 1.0=creative. LLMs only. |
max_tokens | int | model-specific | Max output length (LLMs). Affects cost and latency. |
webhook | string | null | POST callback URL when prediction done. Must be HTTPS. |
API Methods Reference
| Method / Property | Description | Returns |
|---|---|---|
replicate.run(model, input, wait=True, api_token) | Synchronous blocking call. Wait up to 5 min for result. Simplest for <30 sec tasks. | Model output (list, dict, string depending on model) |
replicate.predictions.create(version, input, webhook, api_token) | Start async prediction. Returns immediately with prediction object. Poll status or use webhook. | Prediction object with id, status, output, error fields |
replicate.predictions.get(prediction_id, api_token) | Fetch prediction status/results by ID. Free call. | Prediction object (status: processing/succeeded/failed/canceled) |
replicate.predictions.cancel(prediction_id, api_token) | Stop in-flight prediction. Stops billing immediately. | Prediction object with status=canceled |
replicate.batch.create(version, inputs, api_token) | Submit 100+ predictions of same model. 50% cheaper but 1-3 hr latency. | Batch object with id, predictions list, completed_at |
replicate.batch.get(batch_id, api_token) | Check batch status and individual prediction results. | Batch object with succeeded_count, failed_count, predictions |
Common Errors & Fixes
AuthenticationError: Invalid API token Cause: REPLICATE_API_TOKEN env var missing, empty, or wrong. Tokens expire or revoked.
export REPLICATE_API_TOKEN="r8_..."
# OR pass explicitly:
replicate.run(
"model/name",
input={...},
api_token=os.environ.get("REPLICATE_API_TOKEN")
)
# Verify token at https://replicate.com/account/api-tokens RequestError: 404 Not Found (version not found) Cause: Model name or version ID typo. Model deprecated or private. Using old version ID that no longer exists.
# Check model exists and get latest version:
import requests
r = requests.get(
"https://api.replicate.com/v1/models/stability-ai/stable-diffusion",
headers={"Authorization": f"Token {os.environ['REPLICATE_API_TOKEN']}"}
)
print(r.json()["latest_version"])
# Use correct format: owner/model or owner/model:version
output = replicate.run(
"stability-ai/stable-diffusion-v3", # correct
input={"prompt": "test"},
api_token=os.environ["REPLICATE_API_TOKEN"]
) TimeoutError: Prediction timed out after 5 minutes Cause: Using sync run() for slow model (video processing, large LLM). Model exceeds 5 min time limit.
# Switch to async polling:
prediction = replicate.predictions.create(
version="openai/whisper",
input={"audio": "https://example.com/long-video.mp4"},
api_token=os.environ["REPLICATE_API_TOKEN"]
)
while prediction.status not in ["succeeded", "failed"]:
time.sleep(5) # poll every 5 sec
prediction = replicate.predictions.get(
prediction.id,
api_token=os.environ["REPLICATE_API_TOKEN"]
)
if prediction.status == "succeeded":
print(prediction.output)
else:
print(f"Failed: {prediction.error}") ValidationError: Invalid input: height must be multiple of 64 Cause: Model input constraints violated (dimensions, ranges, enums). Read model input schema.
# Get model input schema:
import requests
resp = requests.get(
"https://api.replicate.com/v1/models/stability-ai/stable-diffusion-v3",
headers={"Authorization": f"Token {os.environ['REPLICATE_API_TOKEN']}"}
)
latest = resp.json()["latest_version"]
print(latest["openapi_schema"]["components"]["schemas"]["Input"])
# Then match constraints:
output = replicate.run(
"stability-ai/stable-diffusion-v3",
input={
"prompt": "test",
"height": 768, # must be 512, 576, 640, 704, 768, etc.
"width": 768,
},
api_token=os.environ["REPLICATE_API_TOKEN"]
) InsufficientCreditsError or RateLimitError: You have exceeded your credits/rate limit Cause: Account out of credits or hit request rate limits (1000 req/min free tier). Batch job failed billing check.
# Check account status:
replicate.account.get(api_token=os.environ["REPLICATE_API_TOKEN"])
# Add delay between requests:
import time
for prompt in prompts:
output = replicate.run(..., api_token=os.environ["REPLICATE_API_TOKEN"])
time.sleep(1) # 1 sec between requests
# Or use batch API for bulk (50% discount):
batch = replicate.batch.create(
version="model-id",
inputs=[{"prompt": p} for p in prompts],
api_token=os.environ["REPLICATE_API_TOKEN"]
) Production Gotchas
replicate.run() will block your process up to 5 min. Longer models (video, LLM) will timeout and fail. For anything >30 sec, use async predictions.create() + polling or webhooks instead. This is the #1 production failure.
Pinning a version ID (the long sha-256 hash) guarantees reproducibility. Using the short name like 'stability-ai/stable-diffusion' always pulls the latest version. In production, always pin to a specific version ID after testing. The short name can break your output determinism.
Replicate retries failed webhooks up to 5 times. Your endpoint may receive the same prediction.id multiple times. Always check if you've already processed this ID before updating your database. Use prediction.id as idempotency key.
Each predictions.get() call is not metered, but status may lag 1-2 sec behind actual completion. Don't poll faster than every 2 sec. Polling every 100ms wastes your time and the server's resources with no benefit.
Batch predictions are 50% cheaper but they queue and run in bulk. You get ~1-3 hour latency. Use replicate.run() or async predictions for real-time use cases. Batch is for background jobs only.
Replicate CDN URLs live for 1 week. If you need the image/audio forever, download it and store in S3 or your database immediately after the prediction completes. Relying on the URL after 7 days will 404.
replicate.run() accepts only HTTP/HTTPS URLs for images, audio, video. You cannot pass local file paths. Upload to S3/public CDN first, then pass the URL. Or use webhooks and server-side Replicate SDK.
Same seed + same params should give identical output. But if the model version itself changes (weights updated), seed no longer guarantees reproducibility. Always pin version ID for reproducible research.
Replicate vs Alternatives
| Feature | Replicate | Hugging Face Inference API | AWS SageMaker |
|---|---|---|---|
| Model selection | 200+ open-source models (curated) | 100k+ models (all HF hub) | Bring your own or AWS models |
| Sync/async | Both (run + predictions.create) | Sync only (simple API) | Async (real infrastructure) |
| Pricing model | Pay per prediction + GPU time | Free tier, then per-token | Per-hour infrastructure (expensive) |
| Webhooks | Yes, included | No (use background jobs) | Yes, via SNS/Lambda |
| Batch API | Yes, 50% discount | No | Yes (SageMaker Batch Transform) |
| Setup time | 1 min (API token) | 1 min (HF token) | 30+ min (VPC, IAM, model upload) |
| Latency | 2-30 sec typical (depends on model) | 1-5 sec (T4 inference) | 5-60 sec (cold start) |