LiteLLM Cheat Sheet — Proxy, Router & Multi-Model — LiteLLM
from litellm import completion, acompletion, Router
from litellm.proxy.server import app Single API wrapper for 100+ LLM providers with automatic routing and fallback.
Like a universal adapter that lets you plug any charger into any device. Your code always calls completion(), but the actual provider (OpenAI, Claude, Gemini) can change without touching your logic.
Core Patterns
from litellm import completion
import os
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Say hello"}],
api_key=os.environ["OPENAI_API_KEY"]
)
print(response.choices[0].message.content) Hello! How can I help you today? import asyncio
from litellm import acompletion
import os
async def call_models():
responses = await asyncio.gather(
acompletion(
model="gpt-4o",
messages=[{"role": "user", "content": "Q1"}],
api_key=os.environ["OPENAI_API_KEY"]
),
acompletion(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": "Q1"}],
api_key=os.environ["ANTHROPIC_API_KEY"]
)
)
return responses
results = asyncio.run(call_models()) from litellm import Router
import os
router = Router(
model_list=[
{
"model_name": "gpt-4-cheap",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"]
},
"cost_per_token": {"prompt": 0.000015, "completion": 0.0006}
},
{
"model_name": "gpt-4-cheap",
"litellm_params": {
"model": "claude-3-5-haiku-20241022",
"api_key": os.environ["ANTHROPIC_API_KEY"]
},
"cost_per_token": {"prompt": 0.00008, "completion": 0.0004}
}
]
)
response = router.completion(
model="gpt-4-cheap",
messages=[{"role": "user", "content": "Hello"}],
timeout=5,
num_retries=2
)
print(response.choices[0].message.content) # config.yaml
model_list:
- model_name: gpt-4
litellm_params:
model: openai/gpt-4o
api_key: $OPENAI_API_KEY
- model_name: claude
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: $ANTHROPIC_API_KEY
# terminal: litellm --config config.yaml --port 8000
# Your app (no changes needed):
import openai
openai.api_base = "http://localhost:8000"
openai.api_key = "anything"
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hi"}]
)
print(response['choices'][0]['message']['content']) from litellm import completion, get_llm_cost
import os
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Count to 100"}],
api_key=os.environ["OPENAI_API_KEY"]
)
# Cost auto-tracked in response.usage
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total cost: ${response._response_ms}")
# Manual cost lookup
model_cost = get_llm_cost(
model="gpt-4o",
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens
)
print(f"Cost: ${model_cost}") from litellm import Router
import os
router = Router(
model_list=[
{
"model_name": "production-gpt",
"litellm_params": {
"model": "azure/gpt-4o",
"api_key": os.environ["AZURE_API_KEY"],
"api_base": "https://myazure.openai.azure.com",
"api_version": "2024-08-01-preview"
}
},
{
"model_name": "production-gpt",
"litellm_params": {
"model": "openai/gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
"timeout": 10
}
}
]
)
response = router.completion(
model="production-gpt",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content) completion() Parameters (Most Common)
completion()
| Parameter | Type | Default | Notes |
|---|---|---|---|
model | str | required | Provider/model string: 'openai/gpt-4o', 'anthropic/claude-3-5-sonnet-20241022', 'bedrock/anthropic.claude-3-sonnet' |
messages | list[dict] | required | [{"role": "user", "content": "..."}, ...] |
temperature | float | 0.7 | Range 0–2. Higher = more creative, lower = more deterministic |
max_tokens | int | None | Cap output length. None = model default (usually 4096) |
top_p | float | 1.0 | Nucleus sampling: 0–1. Lower = more focused. Ignored if temperature set |
timeout | float | 600 | Request timeout in seconds. Router default 600; per-request override supported |
num_retries | int | 0 | Retry count on failure (timeout, rate limit, auth error). Router only |
api_key | str | from env | Override env var. Recommended: use os.environ["KEY"] |
api_base | str | provider default | Custom endpoint URL (Azure, local proxy, private LLM) |
Router API Reference
| Method / Property | Description | Returns |
|---|---|---|
Router.completion(model, messages, **kwargs) | Sync completion call. Routes to model from model_list, falls back on failure, tracks cost. | litellm.Message object with .choices[0].message.content, .usage (tokens), ._response_ms (latency) |
Router.acompletion(model, messages, **kwargs) | Async version. Returns awaitable coroutine. | Coroutine[litellm.Message] |
Router.get_available_models() | List all model_name strings currently in rotation (excludes failed models during retries). | list[str] of model names |
Router.reset() | Clear cache, reset model priorities, stop background health checks. Rarely needed. | None |
get_llm_cost(model, prompt_tokens, completion_tokens) | Manual cost lookup from LiteLLM pricing DB. Returns None if model not found. | float (cost in USD) or None |
get_valid_models() | Return all models LiteLLM knows about (100+ providers). | list[str] of full model IDs |
Common Errors & Fixes
litellm.RateLimitError: Rate limit exceeded Cause: Provider rate limit hit (OpenAI: 3500 RPM / gpt-4o, Anthropic: 50,000 TPM). Single call or sustained load.
Set num_retries on Router or add exponential backoff:
from litellm import completion
import time
for attempt in range(3):
try:
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hi"}],
api_key=os.environ["OPENAI_API_KEY"]
)
break
except Exception as e:
if "rate" in str(e).lower():
wait = 2 ** attempt
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
else:
raise litellm.APIError: API key invalid or expired Cause: api_key env var missing, wrong key, or key revoked at provider. Most common with multi-provider setups.
Verify key exists and is active:
import os
from litellm import completion
# Check key is set
if "OPENAI_API_KEY" not in os.environ:
raise ValueError("OPENAI_API_KEY not in environment")
# Test with simple call
try:
response = completion(
model="gpt-4o",
messages=[{"role": "user", "content": "test"}],
api_key=os.environ["OPENAI_API_KEY"],
timeout=5
)
except Exception as e:
print(f"Auth failed: {e}")
print(f"Key prefix: {os.environ['OPENAI_API_KEY'][:20]}...") Router: No models available (all failed or down) Cause: All models in model_list have failed health checks or thrown unrecoverable errors. Router exhausted retries.
Add fallback model and enable health checks:
router = Router(
model_list=[
{"model_name": "primary", ...},
{"model_name": "primary", ...},
{"model_name": "fallback", "litellm_params": {"model": "gpt-4o-mini", ...}}
],
enable_message_history=True,
num_retries=2 # Retry before fallback
)
# Log which model was used
response = router.completion(model="primary", messages=[...])
print(f"Model used: {response.model}") litellm.APIConnectionError: Connection timeout / Failed to connect Cause: Network issue, provider API down, or custom api_base unreachable. Happens with proxy, Azure, or private LLM endpoints.
Increase timeout and test endpoint:
import requests
import os
from litellm import completion
# Test endpoint before calling
api_base = "https://myazure.openai.azure.com"
try:
health = requests.get(f"{api_base}/health", timeout=5)
print(f"Endpoint OK: {health.status_code}")
except:
print(f"Endpoint unreachable: {api_base}")
# Retry with longer timeout
response = completion(
model="azure/gpt-4o",
messages=[{"role": "user", "content": "Hi"}],
api_base=api_base,
api_key=os.environ["AZURE_API_KEY"],
timeout=30 # Increased from default 600ms
) Router returns None or empty response Cause: Model config missing litellm_params, model string invalid, or API returned empty choice.
Validate model_list config and check response object:
from litellm import Router
import os
router = Router(
model_list=[
{
"model_name": "my-model",
"litellm_params": {
"model": "openai/gpt-4o", # Must be 'provider/model' or 'provider'
"api_key": os.environ["OPENAI_API_KEY"]
}
}
]
)
response = router.completion(
model="my-model",
messages=[{"role": "user", "content": "test"}]
)
if response and response.choices:
print(response.choices[0].message.content)
else:
print(f"Empty response: {response}") Production Gotchas
❌ 'gpt-4o' → works locally, fails in Router without explicit api_key per model ❌ 'openai-gpt-4o' → wrong format ✅ 'openai/gpt-4o' → explicit provider (works everywhere) ✅ 'gpt-4o' → implicit (works with global api_key set in env) In Router model_list, always use full 'provider/model' format to avoid ambiguity.
num_retries only catches: timeout, rate limit, auth error, network failure. It does NOT retry on: - Invalid JSON in response - Model refusing to answer (401 in message) - Hallucinations or wrong output format If you need to retry on logic errors, wrap in your own try-catch.
The proxy accepts ANY Bearer token or api_key header string. It's not validated unless you set master_key in config. ❌ Security risk in untrusted networks: proxy exposes all your API keys ✅ Use only in: internal networks, behind auth proxy (nginx, Cloudflare), or with master_key enabled Always run behind firewall or authentication layer in production.
OpenAI, Anthropic, Google: pricing DB is accurate and auto-updated. Bedrock, Azure, custom models: pricing may be stale or None. For accurate cost tracking: - Hardcode custom model costs in model_list: "cost_per_token": {"prompt": 0.00X, "completion": 0.00Y} - Validate with provider's actual billing dashboard - Don't rely solely on LiteLLM cost for billing
Router ALWAYS picks the cheapest model by cost_per_token first, then round-robins among ties. ❌ Putting expensive model first doesn't prioritize it ✅ To force priority: set cost_per_token to match desired model's actual cost Use same model_name for multiple providers to enable automatic fallback on failure.
Calling acompletion() 1000 times in parallel will create 1000 HTTP connections. ✅ Use async for 10-100 concurrent requests (reasonable) ❌ Don't launch 10k+ concurrent requests without connection pooling For massive parallelism, use asyncio.Semaphore to limit concurrent calls: sem = asyncio.Semaphore(50) async def bounded_call(msg): async with sem: return await acompletion(...) tasks = [bounded_call(msg) for msg in messages] results = await asyncio.gather(*tasks)
timeout=10 means entire request (connect + wait for response) must complete in 10 seconds. With slow providers (Anthropic can take 20s+), you MUST increase timeout: response = completion(..., timeout=60) # 60 seconds for long generations Default 600 seconds is safe for most cases but can hang long-running deployments. Set explicitly.