Cheat Sheet intermediate · 8 min read

LiteLLM Cheat Sheet — Proxy, Router & Multi-Model — LiteLLM

version 1.x

Unified API for all LLM providers, cost tracking, fallback routing

OPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_API_KEYLITELLM_LOGLITELLM_PROXY_BASE_URL

install pip install litellm

core imports

python

from litellm import completion, acompletion, Router
from litellm.proxy.server import app

Mental model

Single API wrapper for 100+ LLM providers with automatic routing and fallback.

Like a universal adapter that lets you plug any charger into any device. Your code always calls completion(), but the actual provider (OpenAI, Claude, Gemini) can change without touching your logic.

Core Patterns

01 Simple completion call

Call any provider with one function

python

from litellm import completion
import os

response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Say hello"}],
    api_key=os.environ["OPENAI_API_KEY"]
)
print(response.choices[0].message.content)

output Hello! How can I help you today?

Model name must be exact: 'gpt-4o' not 'gpt-4'. Provider prefix optional: 'openai/gpt-4o' or 'gpt-4o' both work, but mismatched API keys → auth error.

02 Async completion (non-blocking)

Concurrent requests to multiple providers or models

python

import asyncio
from litellm import acompletion
import os

async def call_models():
    responses = await asyncio.gather(
        acompletion(
            model="gpt-4o",
            messages=[{"role": "user", "content": "Q1"}],
            api_key=os.environ["OPENAI_API_KEY"]
        ),
        acompletion(
            model="claude-3-5-sonnet-20241022",
            messages=[{"role": "user", "content": "Q1"}],
            api_key=os.environ["ANTHROPIC_API_KEY"]
        )
    )
    return responses

results = asyncio.run(call_models())

acompletion() returns a coroutine: must use await or asyncio.run(). Missing await = coroutine object printed, no error thrown.

03 Router with load balancing & fallback

Route requests to cheapest/fastest, fallback on failure

python

from litellm import Router
import os

router = Router(
    model_list=[
        {
            "model_name": "gpt-4-cheap",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"]
            },
            "cost_per_token": {"prompt": 0.000015, "completion": 0.0006}
        },
        {
            "model_name": "gpt-4-cheap",
            "litellm_params": {
                "model": "claude-3-5-haiku-20241022",
                "api_key": os.environ["ANTHROPIC_API_KEY"]
            },
            "cost_per_token": {"prompt": 0.00008, "completion": 0.0004}
        }
    ]
)

response = router.completion(
    model="gpt-4-cheap",
    messages=[{"role": "user", "content": "Hello"}],
    timeout=5,
    num_retries=2
)
print(response.choices[0].message.content)

Router picks model by lowest cost_per_token, not order. If two models have same name ('gpt-4-cheap'), router round-robins across them. Failed models not retried by default: set num_retries to enable fallback.

04 Proxy server (drop-in OpenAI replacement)

Replace OpenAI endpoint in apps without code changes

python

# config.yaml
model_list:
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4o
      api_key: $OPENAI_API_KEY
  - model_name: claude
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      api_key: $ANTHROPIC_API_KEY

# terminal: litellm --config config.yaml --port 8000

# Your app (no changes needed):
import openai
openai.api_base = "http://localhost:8000"
openai.api_key = "anything"

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hi"}]
)
print(response['choices'][0]['message']['content'])

Proxy expects Bearer token in Authorization header or api_key param, even if it's a dummy string. If using OpenAI client library, set api_key to anything non-empty. Database logging not enabled by default: add master_key to config.yaml for auth tracking.

05 Automatic cost & token tracking

Log costs per request without manual bookkeeping

python

from litellm import completion, get_llm_cost
import os

response = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Count to 100"}],
    api_key=os.environ["OPENAI_API_KEY"]
)

# Cost auto-tracked in response.usage
print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Completion tokens: {response.usage.completion_tokens}")
print(f"Total cost: ${response._response_ms}")

# Manual cost lookup
model_cost = get_llm_cost(
    model="gpt-4o",
    prompt_tokens=response.usage.prompt_tokens,
    completion_tokens=response.usage.completion_tokens
)
print(f"Cost: ${model_cost}")

Cost only tracked for models in LiteLLM's pricing DB. Custom/fine-tuned models return cost=None. _response_ms is milliseconds, not cost: use get_llm_cost() for accurate pricing. Some providers don't report token counts (e.g., older Bedrock models).

06 Custom model mapping & deployment names

Map friendly names to long deployment IDs (Azure, private endpoints)

python

from litellm import Router
import os

router = Router(
    model_list=[
        {
            "model_name": "production-gpt",
            "litellm_params": {
                "model": "azure/gpt-4o",
                "api_key": os.environ["AZURE_API_KEY"],
                "api_base": "https://myazure.openai.azure.com",
                "api_version": "2024-08-01-preview"
            }
        },
        {
            "model_name": "production-gpt",
            "litellm_params": {
                "model": "openai/gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
                "timeout": 10
            }
        }
    ]
)

response = router.completion(
    model="production-gpt",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Azure requires api_base, api_version, and deployment_name (passed via model param). If api_base is wrong or unreachable, error is generic 'Connection failed': check config first. Priority order: exact model_name match > wildcard fallback.

completion() Parameters (Most Common)

completion()

Parameter	Type	Default	Notes
`model`	str	required	Provider/model string: 'openai/gpt-4o', 'anthropic/claude-3-5-sonnet-20241022', 'bedrock/anthropic.claude-3-sonnet'
`messages`	list[dict]	required	[{"role": "user", "content": "..."}, ...]
`temperature`	float	0.7	Range 0–2. Higher = more creative, lower = more deterministic
`max_tokens`	int	None	Cap output length. None = model default (usually 4096)
`top_p`	float	1.0	Nucleus sampling: 0–1. Lower = more focused. Ignored if temperature set
`timeout`	float	600	Request timeout in seconds. Router default 600; per-request override supported
`num_retries`	int	0	Retry count on failure (timeout, rate limit, auth error). Router only
`api_key`	str	from env	Override env var. Recommended: use os.environ["KEY"]
`api_base`	str	provider default	Custom endpoint URL (Azure, local proxy, private LLM)

Router API Reference

Method / Property	Description	Returns
`Router.completion(model, messages, **kwargs)`	Sync completion call. Routes to model from model_list, falls back on failure, tracks cost.	litellm.Message object with .choices[0].message.content, .usage (tokens), ._response_ms (latency)
`Router.acompletion(model, messages, **kwargs)`	Async version. Returns awaitable coroutine.	Coroutine[litellm.Message]
`Router.get_available_models()`	List all model_name strings currently in rotation (excludes failed models during retries).	list[str] of model names
`Router.reset()`	Clear cache, reset model priorities, stop background health checks. Rarely needed.	None
`get_llm_cost(model, prompt_tokens, completion_tokens)`	Manual cost lookup from LiteLLM pricing DB. Returns None if model not found.	float (cost in USD) or None
`get_valid_models()`	Return all models LiteLLM knows about (100+ providers).	list[str] of full model IDs

Common Errors & Fixes

01 litellm.RateLimitError: Rate limit exceeded

Cause: Provider rate limit hit (OpenAI: 3500 RPM / gpt-4o, Anthropic: 50,000 TPM). Single call or sustained load.

Fix:

python

Set num_retries on Router or add exponential backoff:

from litellm import completion
import time

for attempt in range(3):
    try:
        response = completion(
            model="gpt-4o",
            messages=[{"role": "user", "content": "Hi"}],
            api_key=os.environ["OPENAI_API_KEY"]
        )
        break
    except Exception as e:
        if "rate" in str(e).lower():
            wait = 2 ** attempt
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        else:
            raise

02 litellm.APIError: API key invalid or expired

Cause: api_key env var missing, wrong key, or key revoked at provider. Most common with multi-provider setups.

Fix:

python

Verify key exists and is active:

import os
from litellm import completion

# Check key is set
if "OPENAI_API_KEY" not in os.environ:
    raise ValueError("OPENAI_API_KEY not in environment")

# Test with simple call
try:
    response = completion(
        model="gpt-4o",
        messages=[{"role": "user", "content": "test"}],
        api_key=os.environ["OPENAI_API_KEY"],
        timeout=5
    )
except Exception as e:
    print(f"Auth failed: {e}")
    print(f"Key prefix: {os.environ['OPENAI_API_KEY'][:20]}...")

03 Router: No models available (all failed or down)

Cause: All models in model_list have failed health checks or thrown unrecoverable errors. Router exhausted retries.

Fix:

python

Add fallback model and enable health checks:

router = Router(
    model_list=[
        {"model_name": "primary", ...},
        {"model_name": "primary", ...},
        {"model_name": "fallback", "litellm_params": {"model": "gpt-4o-mini", ...}}
    ],
    enable_message_history=True,
    num_retries=2  # Retry before fallback
)

# Log which model was used
response = router.completion(model="primary", messages=[...])
print(f"Model used: {response.model}")

04 litellm.APIConnectionError: Connection timeout / Failed to connect

Cause: Network issue, provider API down, or custom api_base unreachable. Happens with proxy, Azure, or private LLM endpoints.

Fix:

python

Increase timeout and test endpoint:

import requests
import os
from litellm import completion

# Test endpoint before calling
api_base = "https://myazure.openai.azure.com"
try:
    health = requests.get(f"{api_base}/health", timeout=5)
    print(f"Endpoint OK: {health.status_code}")
except:
    print(f"Endpoint unreachable: {api_base}")

# Retry with longer timeout
response = completion(
    model="azure/gpt-4o",
    messages=[{"role": "user", "content": "Hi"}],
    api_base=api_base,
    api_key=os.environ["AZURE_API_KEY"],
    timeout=30  # Increased from default 600ms
)

05 Router returns None or empty response

Cause: Model config missing litellm_params, model string invalid, or API returned empty choice.

Fix:

python

Validate model_list config and check response object:

from litellm import Router
import os

router = Router(
    model_list=[
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "openai/gpt-4o",  # Must be 'provider/model' or 'provider'
                "api_key": os.environ["OPENAI_API_KEY"]
            }
        }
    ]
)

response = router.completion(
    model="my-model",
    messages=[{"role": "user", "content": "test"}]
)

if response and response.choices:
    print(response.choices[0].message.content)
else:
    print(f"Empty response: {response}")

Production Gotchas

⚠ Model name string must match provider exactly

❌ 'gpt-4o' → works locally, fails in Router without explicit api_key per model ❌ 'openai-gpt-4o' → wrong format ✅ 'openai/gpt-4o' → explicit provider (works everywhere) ✅ 'gpt-4o' → implicit (works with global api_key set in env) In Router model_list, always use full 'provider/model' format to avoid ambiguity.

⚠ Router doesn't auto-detect or retry on code (user) errors

num_retries only catches: timeout, rate limit, auth error, network failure. It does NOT retry on: - Invalid JSON in response - Model refusing to answer (401 in message) - Hallucinations or wrong output format If you need to retry on logic errors, wrap in your own try-catch.

⚠ Proxy endpoint 'authentication' is cosmetic

The proxy accepts ANY Bearer token or api_key header string. It's not validated unless you set master_key in config. ❌ Security risk in untrusted networks: proxy exposes all your API keys ✅ Use only in: internal networks, behind auth proxy (nginx, Cloudflare), or with master_key enabled Always run behind firewall or authentication layer in production.

⚠ Cost tracking is approximate for non-OpenAI models

OpenAI, Anthropic, Google: pricing DB is accurate and auto-updated. Bedrock, Azure, custom models: pricing may be stale or None. For accurate cost tracking: - Hardcode custom model costs in model_list: "cost_per_token": {"prompt": 0.00X, "completion": 0.00Y} - Validate with provider's actual billing dashboard - Don't rely solely on LiteLLM cost for billing

⚠ Router model_list order matters for cost, NOT for round-robin

Router ALWAYS picks the cheapest model by cost_per_token first, then round-robins among ties. ❌ Putting expensive model first doesn't prioritize it ✅ To force priority: set cost_per_token to match desired model's actual cost Use same model_name for multiple providers to enable automatic fallback on failure.

⚠ acompletion() doesn't pool connections by default

Calling acompletion() 1000 times in parallel will create 1000 HTTP connections. ✅ Use async for 10-100 concurrent requests (reasonable) ❌ Don't launch 10k+ concurrent requests without connection pooling For massive parallelism, use asyncio.Semaphore to limit concurrent calls: sem = asyncio.Semaphore(50) async def bounded_call(msg): async with sem: return await acompletion(...) tasks = [bounded_call(msg) for msg in messages] results = await asyncio.gather(*tasks)

⚠ Timeout is total request time, not connection time

timeout=10 means entire request (connect + wait for response) must complete in 10 seconds. With slow providers (Anthropic can take 20s+), you MUST increase timeout: response = completion(..., timeout=60) # 60 seconds for long generations Default 600 seconds is safe for most cases but can hang long-running deployments. Set explicitly.

Key Concepts

Provider abstraction

LiteLLM translates your single completion() call to the correct API format, auth headers, and endpoint for OpenAI, Anthropic, Groq, Bedrock, Ollama, or any of 100+ providers without code changes.

Model routing

Router selects which model to call based on cost, availability, or load balancing; if that model fails, it automatically tries fallbacks without exposing errors to your application.

Cost tracking

LiteLLM automatically calculates per-request cost using pricing from its built-in database; available in response.usage and can be aggregated for billing dashboards.

Proxy server (drop-in replacement)

Standalone HTTP server that mimics OpenAI's API endpoint, letting you replace your OpenAI client_url or api_base with localhost without rewriting any code: useful for local testing or unified logging.

Fallback chaining

Multiple models with the same model_name in Router's model_list form a fallback chain; if the first fails, Router tries the second, then third, until one succeeds or retries exhausted.

Token counting

LiteLLM reports prompt_tokens and completion_tokens from provider response; some providers don't return this data, in which case usage is None and cost calculation fails.

Verified 2026-04 · v1.x · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022, gemini-2.0-flash

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.