Fallback to backup model when primary fails
Why this matters
Production LLMs fail: rate limits, outages, API errors happen. Without fallbacks, your entire application stops. This pattern keeps your system running when the primary provider has issues.
Explanation
What it is: A fallback chain automatically tries a backup LLM if the primary one raises an exception. In LangChain, you chain multiple models together using the pipe operator (|) with a special fallback wrapper that catches errors.
How it works: When you invoke a chain with fallbacks, LangChain attempts the primary model first. If it throws any exception (timeout, rate limit, 500 error), the system catches it and immediately tries the next model in the fallback sequence. You build this using RunnableWithFallbacks or the shorthand .with_fallbacks() method on any Runnable.
When to use it: Use this for any customer-facing application, batch processing jobs, or APIs where resilience matters more than strict latency guarantees. Typical setup: primary = GPT-4 (fast but rate-limited), fallback = Claude (slower but higher limits).
Analogy
Like having a backup generator: your primary power is the grid, but if it cuts out, the generator kicks in automatically. Your app never goes dark.
Code
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableWithFallbacks
prompt = ChatPromptTemplate.from_template("Explain {topic} in one sentence.")
primary_model = ChatOpenAI(
model="gpt-4o-mini",
temperature=0.7,
api_key="sk-proj-primary-key-here"
)
fallback_model = ChatOpenAI(
model="gpt-4o",
temperature=0.7,
api_key="sk-proj-fallback-key-here"
)
chain = (
prompt
| primary_model.with_fallbacks([fallback_model])
| StrOutputParser()
)
result = chain.invoke({"topic": "neural networks"})
print(f"Result: {result}") Result: Neural networks are computational systems inspired by biological brains that learn patterns from data by adjusting weights through layers of interconnected nodes.
What just happened?
The code created a prompt template, defined two ChatOpenAI models (one primary, one fallback), then chained them together using <code>.with_fallbacks()</code>. When <code>invoke()</code> was called, LangChain attempted the primary model first. If the primary model raised an exception (API error, timeout, rate limit), it would silently catch that exception and invoke the fallback model instead. The final result was parsed into plain text and printed.
Common gotcha
Developers often assume .with_fallbacks() means 'use this if the primary is slow': it does NOT. Fallbacks only trigger on exceptions, not timeouts or slow responses. If you want to switch models based on latency, use a different pattern (like RunnableParallel with timeout). Also, if your fallback model also fails with the same error, the exception bubbles up: fallbacks don't retry, they just try the next option once.
Error recovery
AuthenticationError on both modelsRateLimitError still raised after fallbackAttributeError: 'ChatOpenAI' object has no attribute 'with_fallbacks'Experienced dev note
In production, order your fallbacks by cost and reliability, not just capability. Put your cheapest, most reliable model second, not your best model. A slow-but-stable gpt-4o fallback beats a fast-but-flaky gpt-4o-mini primary for payment processing. Also: log which fallback was used so you know when your primary is degrading: this is your canary for outages.
Check your understanding
Why would adding more fallback models to the chain improve reliability but potentially hurt latency? What scenario would cause all fallbacks to fail?
Show answer hint
A correct answer explains that: (1) more fallbacks mean more sequential attempts if each fails, so latency increases if the primary fails; (2) all fallbacks fail only if the error is not a transient exception (e.g., malformed prompt, invalid model name): not if it's a temporary rate limit.