Cheat Sheet intermediate · 8 min read

Replicate API Cheat Sheet — Run ML Models in Production

version 2024.12

Run open-source AI models via simple API

REPLICATE_API_TOKEN

install pip install replicate

core imports

python

import replicate
import os
from typing import Optional

Mental model

HTTP API to run open-source ML models on serverless infrastructure.

Like calling Uber instead of owning a car fleet. You request a model run, it executes on shared hardware, you get results back: pay per execution.

Core Patterns

01 Simple Text/Image Prediction

Single sync request → get result immediately

python

import replicate
import os

output = replicate.run(
  "stability-ai/stable-diffusion-v3",
  input={
    "prompt": "a cat wearing sunglasses",
    "num_outputs": 1,
    "height": 768,
    "width": 768,
  },
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

print(output[0])  # URL or list of outputs

output ["https://cdn.replicate.com/output_uuid.jpg"]

Times out after 5 min if model takes longer. Use async polling or webhooks for slow models (LLMs, video).

02 Async Polling for Long Tasks

Model takes >30 sec (video, LLM); need real-time progress

python

import replicate
import os
import time

# Start prediction, don't wait
prediction = replicate.predictions.create(
  version="deepseek-ai/deepseek-v2-chat",
  input={
    "prompt": "Explain quantum computing in 500 words",
    "max_tokens": 500,
  },
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

print(f"Prediction ID: {prediction.id}")
print(f"Status: {prediction.status}")

# Poll until done
while prediction.status not in ["succeeded", "failed", "canceled"]:
    time.sleep(2)
    prediction = replicate.predictions.get(
      prediction.id,
      api_token=os.environ["REPLICATE_API_TOKEN"]
    )
    print(f"Status: {prediction.status}")

if prediction.status == "succeeded":
    print(f"Output: {prediction.output}")
else:
    print(f"Error: {prediction.error}")

output

Status: processing
Status: processing
Status: succeeded
Output: ['Quantum computing leverages...']

Poll interval too fast = rate limit. Sleep 1-2 sec minimum. Status checks are free but hammering costs time.

03 Webhooks for Async Completion

Fire-and-forget; callback when done. Best for background jobs.

python

import replicate
import os

# Start prediction with webhook
prediction = replicate.predictions.create(
  version="openai/whisper",
  input={
    "audio": "https://example.com/audio.mp3",
    "model": "large",
    "language": "en"
  },
  webhook=os.environ["WEBHOOK_URL"],  # POST to this URL when done
  webhook_events_filter=["completed"],  # only notify on success
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

print(f"Submitted: {prediction.id}")
print(f"Webhook will POST to: {os.environ['WEBHOOK_URL']}")

output

Submitted: 12abc345def
Webhook will POST to: https://api.example.com/webhooks/replicate

Webhook POST body signature verification required. Check X-Replicate-Content-Sha256 header to prevent spoofing. Replicate may retry failed webhooks.

04 Batch API for Multiple Predictions

100+ predictions of same model. Much cheaper than sequential.

python

import replicate
import os
import json

# Create batch prediction
batch = replicate.batch.create(
  version="stability-ai/stable-diffusion-v3",
  inputs=[
    {"prompt": "a dog", "num_outputs": 1},
    {"prompt": "a cat", "num_outputs": 1},
    {"prompt": "a bird", "num_outputs": 1},
  ],
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

print(f"Batch ID: {batch.id}")
print(f"Predictions: {len(batch.predictions)}")

# Poll batch status
while not batch.completed_at:
    batch = replicate.batch.get(
      batch.id,
      api_token=os.environ["REPLICATE_API_TOKEN"]
    )
    print(f"Succeeded: {batch.succeeded_count}, Failed: {batch.failed_count}")
    time.sleep(3)

for pred in batch.predictions:
    print(f"{pred.input['prompt']}: {pred.output}")

output

Batch ID: 8a9b0c1d2e3f4g5h
Succeeded: 3, Failed: 0

Batch pricing 50% cheaper but latency 1-3 hours. Cannot track individual prediction status in real-time: only batch status. Use only for non-time-critical jobs.

05 Use Specific Model Version (Pinned for Reproducibility)

Production: need deterministic results. Don't want model updates.

python

import replicate
import os

# Version ID locked = same results forever
output = replicate.run(
  "stability-ai/stable-diffusion:ac732df83cea7fff18b8472768c88ad041fa750ff7682a21aef6f8f3a9eaa50d",
  input={
    "prompt": "a red dragon",
    "num_inference_steps": 50,
  },
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

print(output)

output ["https://cdn.replicate.com/output.jpg"]

Using model identifier string (e.g., 'stability-ai/stable-diffusion') auto-upgrades to latest version. Old versions eventually deprecated. Pin version ID in production.

06 Cancel In-Flight Prediction

User cancels, timeout, stop wasting credits

python

import replicate
import os

prediction = replicate.predictions.create(
  version="meta/llama-2-70b-chat",
  input={"prompt": "long task"},
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

print(f"Running: {prediction.id}")

# Cancel it
replicate.predictions.cancel(
  prediction.id,
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

print("Canceled")

output

Running: pred_123
Canceled

Cannot cancel synchronous run() calls: only async predictions.create(). Cancellation may not be instant; check status after.

Common Input Parameters by Model Type

Prediction Parameters

Parameter	Type	Default	Notes
`prompt`	string	required	Text description (image gen, text models). Max 2000 chars typically.
`num_outputs`	int	1	Number of results to return. Image models: 1-4. More = higher cost.
`num_inference_steps`	int	20-50	Quality/speed tradeoff. 20=fast/blurry, 50+=slow/sharp. Image models only.
`guidance_scale`	float	7.5	Prompt adherence (image gen). 1.0=ignore prompt, 20+=over-literal.
`seed`	int	random	Deterministic output. Same seed + same params = identical output.
`temperature`	float	0.8	Text generation randomness. 0=deterministic, 1.0=creative. LLMs only.
`max_tokens`	int	model-specific	Max output length (LLMs). Affects cost and latency.
`webhook`	string	null	POST callback URL when prediction done. Must be HTTPS.

API Methods Reference

Method / Property	Description	Returns
`replicate.run(model, input, wait=True, api_token)`	Synchronous blocking call. Wait up to 5 min for result. Simplest for <30 sec tasks.	Model output (list, dict, string depending on model)
`replicate.predictions.create(version, input, webhook, api_token)`	Start async prediction. Returns immediately with prediction object. Poll status or use webhook.	Prediction object with id, status, output, error fields
`replicate.predictions.get(prediction_id, api_token)`	Fetch prediction status/results by ID. Free call.	Prediction object (status: processing/succeeded/failed/canceled)
`replicate.predictions.cancel(prediction_id, api_token)`	Stop in-flight prediction. Stops billing immediately.	Prediction object with status=canceled
`replicate.batch.create(version, inputs, api_token)`	Submit 100+ predictions of same model. 50% cheaper but 1-3 hr latency.	Batch object with id, predictions list, completed_at
`replicate.batch.get(batch_id, api_token)`	Check batch status and individual prediction results.	Batch object with succeeded_count, failed_count, predictions

Common Errors & Fixes

01 AuthenticationError: Invalid API token

Cause: REPLICATE_API_TOKEN env var missing, empty, or wrong. Tokens expire or revoked.

Fix:

python

export REPLICATE_API_TOKEN="r8_..."
# OR pass explicitly:
replicate.run(
  "model/name",
  input={...},
  api_token=os.environ.get("REPLICATE_API_TOKEN")
)
# Verify token at https://replicate.com/account/api-tokens

02 RequestError: 404 Not Found (version not found)

Cause: Model name or version ID typo. Model deprecated or private. Using old version ID that no longer exists.

Fix:

python

# Check model exists and get latest version:
import requests
r = requests.get(
  "https://api.replicate.com/v1/models/stability-ai/stable-diffusion",
  headers={"Authorization": f"Token {os.environ['REPLICATE_API_TOKEN']}"}
)
print(r.json()["latest_version"])

# Use correct format: owner/model or owner/model:version
output = replicate.run(
  "stability-ai/stable-diffusion-v3",  # correct
  input={"prompt": "test"},
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

03 TimeoutError: Prediction timed out after 5 minutes

Cause: Using sync run() for slow model (video processing, large LLM). Model exceeds 5 min time limit.

Fix:

python

# Switch to async polling:
prediction = replicate.predictions.create(
  version="openai/whisper",
  input={"audio": "https://example.com/long-video.mp4"},
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

while prediction.status not in ["succeeded", "failed"]:
    time.sleep(5)  # poll every 5 sec
    prediction = replicate.predictions.get(
      prediction.id,
      api_token=os.environ["REPLICATE_API_TOKEN"]
    )

if prediction.status == "succeeded":
    print(prediction.output)
else:
    print(f"Failed: {prediction.error}")

04 ValidationError: Invalid input: height must be multiple of 64

Cause: Model input constraints violated (dimensions, ranges, enums). Read model input schema.

Fix:

python

# Get model input schema:
import requests
resp = requests.get(
  "https://api.replicate.com/v1/models/stability-ai/stable-diffusion-v3",
  headers={"Authorization": f"Token {os.environ['REPLICATE_API_TOKEN']}"}
)
latest = resp.json()["latest_version"]
print(latest["openapi_schema"]["components"]["schemas"]["Input"])

# Then match constraints:
output = replicate.run(
  "stability-ai/stable-diffusion-v3",
  input={
    "prompt": "test",
    "height": 768,  # must be 512, 576, 640, 704, 768, etc.
    "width": 768,
  },
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

05 InsufficientCreditsError or RateLimitError: You have exceeded your credits/rate limit

Cause: Account out of credits or hit request rate limits (1000 req/min free tier). Batch job failed billing check.

Fix:

python

# Check account status:
replicate.account.get(api_token=os.environ["REPLICATE_API_TOKEN"])

# Add delay between requests:
import time
for prompt in prompts:
    output = replicate.run(..., api_token=os.environ["REPLICATE_API_TOKEN"])
    time.sleep(1)  # 1 sec between requests

# Or use batch API for bulk (50% discount):
batch = replicate.batch.create(
  version="model-id",
  inputs=[{"prompt": p} for p in prompts],
  api_token=os.environ["REPLICATE_API_TOKEN"]
)

Production Gotchas

⚠ Sync run() blocks forever; timeout is 5 minutes

replicate.run() will block your process up to 5 min. Longer models (video, LLM) will timeout and fail. For anything >30 sec, use async predictions.create() + polling or webhooks instead. This is the #1 production failure.

⚠ Model versions are immutable; updates create new versions

Pinning a version ID (the long sha-256 hash) guarantees reproducibility. Using the short name like 'stability-ai/stable-diffusion' always pulls the latest version. In production, always pin to a specific version ID after testing. The short name can break your output determinism.

⚠ Webhook retry logic can deliver duplicates; idempotency is your job

Replicate retries failed webhooks up to 5 times. Your endpoint may receive the same prediction.id multiple times. Always check if you've already processed this ID before updating your database. Use prediction.id as idempotency key.

⚠ Polling predictions.get() is free but has stale status

Each predictions.get() call is not metered, but status may lag 1-2 sec behind actual completion. Don't poll faster than every 2 sec. Polling every 100ms wastes your time and the server's resources with no benefit.

⚠ Batch API latency is 1-3 hours; not for real-time use

Batch predictions are 50% cheaper but they queue and run in bulk. You get ~1-3 hour latency. Use replicate.run() or async predictions for real-time use cases. Batch is for background jobs only.

⚠ Output URLs expire; download immediately or store in your DB

Replicate CDN URLs live for 1 week. If you need the image/audio forever, download it and store in S3 or your database immediately after the prediction completes. Relying on the URL after 7 days will 404.

⚠ File uploads must be URLs, not local paths

replicate.run() accepts only HTTP/HTTPS URLs for images, audio, video. You cannot pass local file paths. Upload to S3/public CDN first, then pass the URL. Or use webhooks and server-side Replicate SDK.

⚠ Seed != reproducibility if model is still training

Same seed + same params should give identical output. But if the model version itself changes (weights updated), seed no longer guarantees reproducibility. Always pin version ID for reproducible research.

Replicate vs Alternatives

Feature	Replicate	Hugging Face Inference API	AWS SageMaker
Model selection	200+ open-source models (curated)	100k+ models (all HF hub)	Bring your own or AWS models
Sync/async	Both (run + predictions.create)	Sync only (simple API)	Async (real infrastructure)
Pricing model	Pay per prediction + GPU time	Free tier, then per-token	Per-hour infrastructure (expensive)
Webhooks	Yes, included	No (use background jobs)	Yes, via SNS/Lambda
Batch API	Yes, 50% discount	No	Yes (SageMaker Batch Transform)
Setup time	1 min (API token)	1 min (HF token)	30+ min (VPC, IAM, model upload)
Latency	2-30 sec typical (depends on model)	1-5 sec (T4 inference)	5-60 sec (cold start)

Verified 2026-04 · v2024.12

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.