Code Beginner easy · 5 min

Fallback to backup model when primary fails

What you will learn

Use LangChain's built-in fallback mechanism to automatically switch to a backup LLM when your primary model fails or times out.

Why this matters

Production LLMs fail: rate limits, outages, API errors happen. Without fallbacks, your entire application stops. This pattern keeps your system running when the primary provider has issues.

Skip if: Don't use fallbacks if you have a synchronous, mission-critical system where any delay is unacceptable. Also unnecessary if your only model is a local inference server with 100% uptime SLA.

Explanation

What it is: A fallback chain automatically tries a backup LLM if the primary one raises an exception. In LangChain, you chain multiple models together using the pipe operator (|) with a special fallback wrapper that catches errors.

How it works: When you invoke a chain with fallbacks, LangChain attempts the primary model first. If it throws any exception (timeout, rate limit, 500 error), the system catches it and immediately tries the next model in the fallback sequence. You build this using RunnableWithFallbacks or the shorthand .with_fallbacks() method on any Runnable.

When to use it: Use this for any customer-facing application, batch processing jobs, or APIs where resilience matters more than strict latency guarantees. Typical setup: primary = GPT-4 (fast but rate-limited), fallback = Claude (slower but higher limits).

Analogy

Like having a backup generator: your primary power is the grid, but if it cuts out, the generator kicks in automatically. Your app never goes dark.

Code

Illustrative only - not runnable without a valid API key

python

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableWithFallbacks

prompt = ChatPromptTemplate.from_template("Explain {topic} in one sentence.")

primary_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0.7,
    api_key="sk-proj-primary-key-here"
)

fallback_model = ChatOpenAI(
    model="gpt-4o",
    temperature=0.7,
    api_key="sk-proj-fallback-key-here"
)

chain = (
    prompt
    | primary_model.with_fallbacks([fallback_model])
    | StrOutputParser()
)

result = chain.invoke({"topic": "neural networks"})
print(f"Result: {result}")

Output

Result: Neural networks are computational systems inspired by biological brains that learn patterns from data by adjusting weights through layers of interconnected nodes.

What just happened?

The code created a prompt template, defined two ChatOpenAI models (one primary, one fallback), then chained them together using <code>.with_fallbacks()</code>. When <code>invoke()</code> was called, LangChain attempted the primary model first. If the primary model raised an exception (API error, timeout, rate limit), it would silently catch that exception and invoke the fallback model instead. The final result was parsed into plain text and printed.

Common gotcha

Developers often assume .with_fallbacks() means 'use this if the primary is slow': it does NOT. Fallbacks only trigger on exceptions, not timeouts or slow responses. If you want to switch models based on latency, use a different pattern (like RunnableParallel with timeout). Also, if your fallback model also fails with the same error, the exception bubbles up: fallbacks don't retry, they just try the next option once.

Error recovery

AuthenticationError on both models

Both API keys are invalid. Fix: verify both keys work independently by calling each model separately before chaining.

RateLimitError still raised after fallback

The fallback model is also rate-limited. Fix: add a third fallback with higher quotas, or add exponential backoff using chain.with_retry() before with_fallbacks().

AttributeError: 'ChatOpenAI' object has no attribute 'with_fallbacks'

You're using an old langchain version (< 0.3.x). Fix: upgrade with pip install --upgrade langchain-core langchain-openai.

Experienced dev note

In production, order your fallbacks by cost and reliability, not just capability. Put your cheapest, most reliable model second, not your best model. A slow-but-stable gpt-4o fallback beats a fast-but-flaky gpt-4o-mini primary for payment processing. Also: log which fallback was used so you know when your primary is degrading: this is your canary for outages.

Check your understanding

Why would adding more fallback models to the chain improve reliability but potentially hurt latency? What scenario would cause all fallbacks to fail?

Show answer hint

A correct answer explains that: (1) more fallbacks mean more sequential attempts if each fails, so latency increases if the primary fails; (2) all fallbacks fail only if the error is not a transient exception (e.g., malformed prompt, invalid model name): not if it's a temporary rate limit.

VERSION LangChain 0.3.x (April 2026) unified the Runnable interface. In langchain < 0.3.0, fallbacks were added via RunnableWithFallbacks() directly; now use .with_fallbacks() method which is cleaner.

Next, learn how to add automatic retry logic with exponential backoff to your chains so transient failures recover without needing a fallback model.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.